In [1]:
%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; } </style>'))


Autosave disabled

Spam Detection Using the Naive Bayes Algorithm

The process of creating a spam detector using the naive Bayes algorithm is split up into four steps.

  • Create a set of the most common words occurring in spam and ham (i.e. non-spam) emails.
  • For every word occurring in this set, compute the conditional probability that this words occurs in a spam or ham email.
  • Create a function that takes an email and the conditional probabilities computed before and that then computes the probability that the given email is spam.
  • Evaluate the precision and the recall of the spam classifier.

Step 1: Create Word Dictionary

We need the module os for reading directories and the module re for regular expressions.


In [2]:
import os
import re
import numpy as np
import math

An object of class `Counter` is a special form of a dictionary that is used for counting. We need a counter to figure out what the most common words are.


In [3]:
from collections import Counter

The directory https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData contains 960 emails that are divided into four subdirectories:

  • spam-train contains 350 spam emails for training,
  • ham-train contains 350 non-spam emails for training,
  • spam-test contains 130 spam emails for testing,
  • ham-test contains 130 non-spam emails for testing.

Originally, this data has been collected by Ion Androutsopoulos. I have found this data on the page http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng.

We declare some variables so this notebook can be adapted to other data sets.


In [4]:
spam_dir_train = 'EmailData/spam-train/'
ham__dir_train = 'EmailData/ham-train/'
spam_dir_test  = 'EmailData/spam-test/'
ham__dir_test  = 'EmailData/ham-test/'
Directories    = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]

In order to compute the prior probability that an email is ham or spam we need to count the number of spam and ham emails.


In [5]:
no_spam    = len(os.listdir(spam_dir_train))
no_ham     = len(os.listdir(ham__dir_train))
spam_prior = no_spam / (no_spam + no_ham)
ham__prior = no_ham  / (no_spam + no_ham)
spam_prior, ham__prior


Out[5]:
(0.5, 0.5)

I have checked that the proportion of spam and ham emails in the test directory is also $1:1$. If the proportion of spam and ham emails in life is different from $1:1$, then we would have to use this proportion in the spam filter to be developed.

The function $\texttt{get_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument. It reads the file and returns a set of all words that are found in this file. The words are transformed to lower case.


In [6]:
def get_words(fn):
    file = open(fn)
    text = file.read()
    text = text.lower()
    return set(re.findall(r"[\w']+", text))

Let us test this function with a small example mail.


In [7]:
get_words('EmailData/ham-train/3-380msg4.txt')


Out[7]:
{'anyone',
 'article',
 'berkeley',
 'book',
 'consonant',
 'edu',
 'english',
 'garnet',
 'hard',
 'helpful',
 'hi',
 'interest',
 'irish',
 'laurel',
 'm',
 'modern',
 'palatal',
 'phonetics',
 'posting',
 'project',
 'recommend',
 'slender',
 'source',
 'specifically',
 'sutton',
 'thank',
 'too',
 'work'}

The function read_all_files reads all files contained in those directories that are stored in the list Directories. It returns a Counter. For every word $w$ this counter contains the number of files that contain $w$.


In [8]:
def read_all_files():
    Words = Counter()
    for directory in Directories:
        for file_name in os.listdir(directory):
            Words.update(get_words(directory + file_name))
    return Words

Common_Words is a list of the 2500 most common words found in all of our emails.


In [9]:
N            = 2500             # number of the most common words to use
Word_Counter = read_all_files()
Word_Counter


Out[9]:
Counter({'eminent': 9,
         'earn': 69,
         'experience': 123,
         'through': 155,
         'phd': 22,
         'prestige': 9,
         'increase': 69,
         'grant': 23,
         'effort': 75,
         'mba': 8,
         'choice': 51,
         'here': 259,
         'short': 86,
         'field': 117,
         'part': 131,
         'personal': 102,
         'programs': 21,
         'base': 134,
         'ba': 13,
         'phone': 202,
         'power': 52,
         'necessary': 55,
         'degree': 41,
         'further': 154,
         'detail': 143,
         'call': 347,
         'advance': 81,
         'require': 131,
         'nonaccredit': 8,
         'award': 20,
         'present': 142,
         'knowledge': 72,
         'money': 187,
         'university': 307,
         'diploma': 10,
         'ma': 37,
         'cost': 147,
         'entire': 45,
         'conference': 138,
         'grab': 9,
         'week': 173,
         'receive': 283,
         'start': 173,
         'leverage': 5,
         'offence': 4,
         'our': 365,
         'delete': 59,
         'po': 53,
         'old': 83,
         'mailer': 20,
         'financial': 70,
         'member': 104,
         'problem': 128,
         'believe': 103,
         'ago': 65,
         'throw': 20,
         'customer': 69,
         'hello': 54,
         'letter': 106,
         'inexpensive': 24,
         'guarantee': 100,
         'ignore': 42,
         'complete': 119,
         'control': 53,
         'outside': 43,
         'cash': 91,
         'name': 289,
         'usa': 122,
         'state': 220,
         'pardon': 9,
         'texa': 35,
         'cst': 5,
         'reside': 3,
         'send': 360,
         'lifeline': 1,
         'later': 81,
         'without': 122,
         'print': 107,
         'program': 226,
         'honestly': 6,
         'best': 206,
         'nobrainer': 1,
         'one': 404,
         'note': 148,
         'free': 302,
         'show': 161,
         'computer': 152,
         'credit': 103,
         'registration': 86,
         'must': 181,
         'grapevine': 1,
         'process': 161,
         'center': 60,
         'today': 179,
         'weekly': 35,
         'mind': 62,
         'zip': 75,
         'interest': 283,
         'compound': 12,
         'few': 128,
         'address': 379,
         'simple': 111,
         'telephone': 91,
         'educational': 22,
         'main': 72,
         'worth': 48,
         'entitle': 13,
         'convert': 12,
         'plan': 88,
         's': 560,
         'message': 189,
         'join': 95,
         'number': 248,
         'respond': 45,
         'box': 124,
         'achieve': 42,
         'card': 112,
         'life': 99,
         'solution': 28,
         'mortgage': 18,
         'please': 445,
         'city': 120,
         'information': 448,
         'especially': 74,
         'net': 100,
         'id': 34,
         'participate': 63,
         'us': 308,
         'pull': 8,
         'independence': 14,
         'tuesday': 21,
         'enable': 26,
         'company': 139,
         'over': 250,
         'simply': 123,
         'night': 39,
         'pm': 42,
         'finances': 2,
         'intrusion': 18,
         'return': 103,
         'solid': 15,
         'establish': 35,
         'mean': 81,
         'freedom': 47,
         'peace': 7,
         'form': 210,
         'begin': 69,
         'system': 171,
         'debt': 40,
         'obtain': 41,
         'secure': 32,
         'per': 141,
         'pack': 15,
         'cozy': 1,
         'oct': 6,
         'vacation': 37,
         'west': 26,
         'archery': 1,
         'felton': 1,
         'pay': 149,
         'e': 294,
         'home': 161,
         'accomodation': 10,
         'virginium': 9,
         'turkey': 3,
         'deer': 1,
         'loader': 6,
         'wonderful': 12,
         'sesson': 1,
         'cook': 3,
         'economical': 3,
         'meal': 6,
         'buck': 12,
         'room': 55,
         'mail': 350,
         'reserve': 28,
         'stay': 32,
         'noon': 5,
         'nov': 3,
         'muzzel': 1,
         'hunt': 3,
         'season': 10,
         'announce': 71,
         'want': 231,
         'follow': 320,
         'space': 44,
         'wood': 3,
         'com': 257,
         'compuserve': 26,
         'day': 244,
         'dec': 7,
         'wild': 7,
         'lunch': 44,
         'book': 145,
         'camp': 6,
         'three': 103,
         'doe': 29,
         'additional': 110,
         'million': 111,
         'wi': 1,
         'reach': 72,
         'commercial': 37,
         'info': 71,
         'future': 116,
         'success': 82,
         'nettool': 1,
         'fingertip': 9,
         'internet': 188,
         'software': 119,
         'network': 43,
         'search': 88,
         'permanently': 12,
         'area': 161,
         'evaluation': 39,
         'proper': 23,
         'requirement': 43,
         'presence': 26,
         'section': 74,
         'stop': 75,
         'regard': 60,
         'propose': 46,
         'web': 211,
         'advantage': 67,
         'sender': 27,
         'certain': 47,
         'help': 164,
         'remove': 203,
         'storefront': 2,
         'target': 36,
         'product': 137,
         'fellow': 22,
         'promote': 38,
         'luck': 34,
         'basis': 64,
         'request': 157,
         'loc': 2,
         'comply': 24,
         'recent': 65,
         'lead': 63,
         'mailing': 71,
         'bill': 84,
         'selection': 38,
         'c': 174,
         'ooo': 1,
         'waterford': 1,
         'reply': 131,
         'ten': 35,
         'paragraph': 13,
         'post': 113,
         'unite': 61,
         'transmission': 13,
         'gov': 30,
         'http': 399,
         'entrepreneur': 14,
         'subject': 192,
         'tool': 70,
         'service': 171,
         'dear': 70,
         'business': 164,
         'assist': 24,
         'level': 107,
         'need': 250,
         'sale': 74,
         'thoma': 16,
         'item': 40,
         'unbelievable': 9,
         'much': 190,
         'try': 125,
         'set': 105,
         'wish': 142,
         'thank': 182,
         'market': 156,
         'email': 429,
         'vast': 5,
         'online': 126,
         'venture': 7,
         'federal': 36,
         'audience': 16,
         'unwise': 1,
         'check': 210,
         'greatest': 35,
         'unmissable': 1,
         're': 198,
         'titanictesco': 1,
         'park': 32,
         'fame': 1,
         'onto': 10,
         'release': 45,
         'include': 354,
         'player': 13,
         'visit': 138,
         'ultimate': 15,
         'refreshment': 1,
         'stack': 7,
         'gossip': 6,
         'shop': 35,
         'while': 116,
         'chart': 10,
         'cd': 63,
         'never': 120,
         'unlikely': 4,
         'package': 73,
         'alway': 91,
         'www': 296,
         'undead': 1,
         'band': 12,
         'why': 118,
         'billy': 2,
         'event': 50,
         'full': 143,
         'right': 144,
         'digital': 33,
         'delay': 16,
         'yourself': 90,
         'late': 30,
         'friend': 90,
         'easy': 125,
         'available': 254,
         'beautiful': 32,
         'placebo': 3,
         'chance': 63,
         'fantastic': 35,
         'top': 77,
         'pick': 44,
         'mtv': 1,
         'glamour': 2,
         'run': 81,
         'access': 98,
         'john': 96,
         'competition': 40,
         'click': 135,
         'offer': 229,
         'compaq': 5,
         'n': 78,
         'pop': 22,
         'roll': 24,
         'scoop': 6,
         'dizzy': 1,
         'premiere': 4,
         'big': 63,
         'sound': 76,
         'bathtub': 2,
         'reporter': 6,
         'crash': 7,
         'witch': 7,
         'radio': 28,
         'tesco': 1,
         'portrait': 1,
         'drink': 6,
         'milan': 4,
         'down': 96,
         'atmosphere': 4,
         'play': 64,
         'provide': 203,
         'london': 40,
         'thing': 109,
         'aqua': 3,
         'crazy': 6,
         'fun': 66,
         'tale': 3,
         'site': 218,
         'record': 65,
         'spellbind': 1,
         'prepare': 44,
         'nt': 222,
         'true': 78,
         'leicester': 3,
         'unsubscribe': 23,
         'glitz': 2,
         'b': 118,
         'technology': 84,
         'xpack': 1,
         'robbie': 7,
         'emma': 1,
         'fizzy': 1,
         'rem': 4,
         'icon': 6,
         'miss': 66,
         'exclusive': 39,
         'capitalfm': 23,
         'hit': 54,
         'spook': 1,
         'thursday': 26,
         'save': 108,
         'straight': 13,
         'choose': 75,
         'question': 221,
         'rock': 11,
         'star': 27,
         'music': 29,
         'europe': 41,
         'halloween': 1,
         'bumper': 1,
         'hesitate': 39,
         'accelerate': 5,
         'graphic': 29,
         'storm': 10,
         'horror': 1,
         'instant': 16,
         'supply': 23,
         'special': 162,
         'spin': 6,
         'prizewin': 1,
         'll': 147,
         'regular': 41,
         'hurry': 12,
         'many': 244,
         'even': 194,
         'colors': 1,
         'reveal': 18,
         'celine': 3,
         'ghost': 1,
         'too': 85,
         'attend': 30,
         've': 124,
         'website': 81,
         'starstud': 1,
         'travolta': 1,
         'foyer': 2,
         'adulterous': 1,
         'list': 329,
         'classic': 12,
         'absolutely': 50,
         'south': 45,
         'enter': 74,
         'latest': 57,
         'doorstep': 5,
         'pc': 37,
         'prize': 25,
         'label': 21,
         'roundup': 3,
         'connolly': 1,
         'dion': 3,
         'tell': 130,
         'megastar': 1,
         'fill': 88,
         'desktop': 7,
         'presario': 4,
         'dolby': 4,
         'nail': 3,
         'win': 120,
         'paradise': 14,
         'stock': 31,
         'thompson': 10,
         'scary': 7,
         'titanic': 1,
         'couple': 38,
         'guess': 17,
         'discount': 30,
         'flick': 1,
         'u': 123,
         'entirely': 11,
         'amaze': 52,
         'link': 96,
         'advertisement': 54,
         'better': 106,
         'william': 34,
         'feel': 62,
         'become': 90,
         'spooky': 1,
         'album': 16,
         'game': 46,
         'still': 99,
         'manufacturer': 13,
         'buy': 111,
         'primary': 23,
         'bring': 102,
         'screen': 25,
         'president': 18,
         'biz': 10,
         'coolest': 3,
         'surround': 12,
         'poster': 24,
         'everythe': 21,
         'fm': 15,
         'focus': 78,
         'talk': 92,
         'team': 32,
         'jimmy': 2,
         'mailbox': 27,
         'cdparadise': 4,
         'next': 121,
         'catch': 18,
         'favourite': 13,
         'world': 183,
         'saint': 5,
         'laugh': 19,
         'up': 1,
         'whether': 85,
         'performance': 22,
         'bunch': 6,
         'hot': 45,
         'bath': 3,
         'head': 44,
         'fantasy': 7,
         'square': 6,
         'capital': 58,
         'movie': 32,
         'major': 123,
         'submission': 95,
         'hrs': 3,
         'resubmit': 2,
         'meta': 3,
         'automatically': 39,
         'report': 129,
         'calle': 4,
         'notice': 40,
         'engine': 50,
         'compose': 6,
         'fees': 8,
         'within': 176,
         'advertiser': 16,
         'bulk': 79,
         'after': 163,
         'each': 201,
         'etc': 150,
         'every': 171,
         'appropriate': 33,
         'page': 181,
         'toll': 50,
         'monthly': 39,
         'pro': 19,
         'hr': 18,
         'extractor': 14,
         'block': 30,
         'month': 131,
         'review': 87,
         'trie': 4,
         'submit': 104,
         'media': 16,
         'tag': 11,
         'thousands': 12,
         'solve': 12,
         'helps': 1,
         'reg': 15,
         'dollar': 98,
         'something': 67,
         'gotta': 5,
         'wasus': 2,
         'spam': 31,
         'safeaddress': 1,
         'idc': 2,
         'discreet': 5,
         'powerful': 39,
         'quickly': 37,
         'exceptions': 4,
         'community': 45,
         't': 146,
         'high': 88,
         'literally': 13,
         'general': 107,
         'along': 78,
         'travel': 71,
         'ask': 137,
         'benefit': 38,
         'oversea': 14,
         'paper': 176,
         'finance': 17,
         'soundest': 4,
         'promise': 37,
         'legally': 14,
         'amount': 101,
         'extract': 29,
         'clearly': 36,
         'confirm': 27,
         'certainly': 25,
         'espouse': 5,
         'upon': 51,
         'contract': 19,
         'beverly': 2,
         'word': 191,
         'extra': 64,
         'nbc': 4,
         'thousand': 99,
         'means': 50,
         'curency': 2,
         'work': 304,
         'soon': 95,
         'monitor': 13,
         'before': 190,
         'ending': 2,
         'themselve': 51,
         'transact': 4,
         'vary': 24,
         'tran': 7,
         'march': 62,
         'transaction': 9,
         'move': 71,
         'ca': 111,
         'under': 109,
         'exactly': 65,
         'kid': 23,
         'public': 41,
         'bl': 4,
         'nightly': 7,
         'view': 66,
         'greatly': 16,
         'earlier': 29,
         'contact': 203,
         'likewise': 13,
         'currency': 17,
         'minute': 106,
         'wall': 17,
         'create': 89,
         'reason': 77,
         'daily': 40,
         'yet': 50,
         'effect': 46,
         'editorial': 15,
         'santa': 14,
         'optional': 26,
         'conversion': 7,
         'flaw': 6,
         'back': 139,
         'completely': 65,
         'end': 105,
         'amass': 4,
         'individual': 75,
         'operate': 37,
         'organization': 50,
         'however': 97,
         'watch': 64,
         'someone': 86,
         'rate': 101,
         'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2,
         'wealth': 16,
         'fortune': 26,
         'own': 170,
         'wealthiest': 6,
         'cartel': 4,
         'explosive': 8,
         'political': 23,
         'membership': 30,
         'corner': 14,
         'national': 53,
         'change': 137,
         'hemisphere': 4,
         'payable': 59,
         'mllionaire': 2,
         'attache': 2,
         'dollars': 21,
         'write': 160,
         'o': 120,
         'overnight': 34,
         'anniversarry': 2,
         'let': 121,
         'group': 104,
         'first': 274,
         'assure': 16,
         'rumble': 4,
         'profile': 14,
         'same': 154,
         'attention': 60,
         'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2,
         'publication': 87,
         'continue': 64,
         'postage': 23,
         'else': 87,
         'gold': 28,
         'instruction': 102,
         'nor': 42,
         'cold': 9,
         'm': 209,
         'int': 6,
         'fee': 94,
         'most': 232,
         'date': 129,
         'different': 161,
         'announcement': 47,
         'concern': 64,
         'glad': 15,
         'unlike': 16,
         'earth': 33,
         'guise': 6,
         'able': 99,
         'parent': 10,
         'easily': 63,
         'anyone': 130,
         'add': 122,
         'york': 68,
         'depend': 39,
         'long': 83,
         'ourself': 2,
         'allow': 112,
         'action': 61,
         'pertinent': 5,
         'below': 213,
         'street': 65,
         'exist': 63,
         'operation': 23,
         'legal': 67,
         'advice': 18,
         'monica': 4,
         'extremely': 47,
         'disclose': 6,
         'leave': 103,
         'cancel': 13,
         'important': 110,
         'californium': 52,
         'lessly': 2,
         'refund': 34,
         'american': 88,
         'uniform': 7,
         'document': 38,
         'confidential': 16,
         'supporter': 6,
         'hand': 89,
         'read': 167,
         'conclude': 22,
         'reiterate': 4,
         'keep': 120,
         'grow': 52,
         'until': 86,
         'surely': 28,
         'hi': 34,
         'secret': 55,
         'global': 33,
         'unlimit': 34,
         'administrative': 8,
         'profit': 63,
         'enquiry': 13,
         'divulge': 4,
         'don': 59,
         'great': 141,
         'line': 156,
         'learn': 124,
         'ship': 60,
         'immediately': 84,
         'those': 183,
         'instruct': 21,
         'limited': 23,
         'ourselve': 14,
         'worldwide': 44,
         'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2,
         'excerpt': 8,
         'purpose': 40,
         'source': 67,
         'plus': 109,
         'again': 128,
         'office': 95,
         'left': 3,
         'school': 65,
         'low': 51,
         'hundred': 66,
         'envy': 4,
         'total': 79,
         'hills': 2,
         'd': 212,
         'recently': 56,
         'second': 117,
         'suite': 58,
         'exchange': 43,
         'share': 84,
         'method': 99,
         'fluctuate': 5,
         'differential': 6,
         'around': 67,
         'britney': 4,
         'tomorrow': 9,
         'tip': 29,
         'noisy': 2,
         'listen': 29,
         'loca': 1,
         'sneak': 5,
         'excite': 61,
         'peek': 6,
         'everybody': 16,
         'teacher': 25,
         'ride': 4,
         'beer': 11,
         'smash': 3,
         'calm': 2,
         'vonda': 1,
         'answer': 95,
         'gerus': 2,
         'century': 23,
         'ever': 120,
         'stereo': 1,
         'chat': 21,
         'californication': 1,
         'universal': 33,
         'channel': 19,
         'globe': 10,
         'zone': 11,
         'hottest': 30,
         'chilus': 1,
         'uk': 97,
         'celluloid': 1,
         'red': 11,
         'tvchannel': 3,
         'lines': 8,
         'song': 15,
         'wait': 71,
         'singles': 7,
         'fave': 1,
         'halliwell': 2,
         'past': 80,
         'terminator': 1,
         'whispering': 1,
         'tv': 26,
         'compzone': 10,
         'preacher': 3,
         'june': 60,
         'eurovision': 2,
         'man': 46,
         'forthcome': 12,
         'rd': 59,
         'piece': 42,
         'centrepiece': 1,
         'goss': 1,
         'playstation': 5,
         'break': 89,
         'newradioworld': 3,
         'bag': 10,
         'ricky': 2,
         'stress': 20,
         'fabulous': 21,
         'manic': 3,
         'itch': 1,
         'diary': 2,
         'backstreet': 2,
         'live': 114,
         'highlight': 13,
         'examiner': 1,
         'ad': 74,
         'recognise': 3,
         'entertain': 7,
         'martin': 21,
         'angele': 14,
         'studio': 8,
         'beverage': 10,
         'gear': 4,
         'st': 90,
         'saturday': 44,
         'dreamcast': 1,
         'somethe': 4,
         'webchat': 2,
         'co': 53,
         'delivery': 52,
         'summer': 30,
         'both': 176,
         'weekend': 20,
         'ring': 8,
         'th': 156,
         'despatch': 4,
         'foursome': 1,
         'preparation': 12,
         'madonna': 3,
         'panic': 2,
         'where': 198,
         'taylor': 8,
         'till': 4,
         'la': 42,
         'girls': 5,
         'professor': 46,
         'baz': 3,
         'livin': 1,
         'winning': 6,
         'luhrmann': 3,
         'revisionline': 1,
         'holiday': 25,
         'meet': 98,
         'lyric': 10,
         'shepard': 1,
         'boyzone': 6,
         'revision': 3,
         'video': 73,
         'size': 37,
         'nd': 59,
         'bargain': 16,
         'goodie': 2,
         'precious': 3,
         'vote': 15,
         'braless': 1,
         'prof': 34,
         'ticket': 37,
         'feature': 90,
         'prior': 37,
         'rubber': 1,
         'carefully': 29,
         'really': 106,
         'order': 271,
         'vida': 1,
         'pepper': 1,
         'ball': 7,
         'film': 19,
         'musical': 11,
         'wednesday': 23,
         'expensive': 22,
         'millennium': 8,
         'boy': 35,
         'schizophonic': 2,
         'winner': 25,
         'lot': 85,
         'cinema': 14,
         'margherita': 2,
         'gadget': 2,
         'title': 132,
         'sega': 1,
         'comp': 9,
         'lo': 19,
         'everyone': 74,
         'kit': 12,
         'mark': 66,
         'character': 21,
         'preorder': 7,
         'price': 139,
         'love': 55,
         'hits': 6,
         'jeni': 1,
         'girl': 36,
         'xxx': 38,
         'teen': 19,
         'trial': 33,
         'tempt': 4,
         'index': 56,
         'adult': 70,
         'tantalize': 3,
         'forbidden': 2,
         'html': 129,
         'mci': 17,
         'z': 13,
         'ones': 4,
         'shortest': 1,
         'range': 57,
         'kevin': 6,
         'cyberpromo': 3,
         'several': 109,
         'blank': 17,
         'familiar': 20,
         'numerous': 22,
         'duplicate': 31,
         'circle': 10,
         'finish': 24,
         'opportunity': 116,
         'extension': 23,
         'international': 141,
         'user': 53,
         'gigabyte': 1,
         'contain': 98,
         'newer': 2,
         'responsive': 7,
         'broad': 18,
         'possible': 131,
         'profanity': 9,
         'fund': 42,
         'bank': 81,
         'randomly': 4,
         'seal': 5,
         'offers': 6,
         'almost': 50,
         'off': 107,
         'ours': 17,
         'canada': 55,
         'cause': 28,
         'released': 15,
         'teaser': 1,
         'newsgroup': 15,
         'risk': 50,
         'cleanest': 16,
         'postings': 1,
         'vulgarity': 10,
         'cut': 43,
         'mine': 30,
         'highly': 37,
         'alberta': 2,
         'sort': 36,
         'fax': 247,
         'filter': 31,
         'produce': 63,
         'seeker': 3,
         'wrap': 23,
         'ups': 2,
         'dure': 13,
         'download': 43,
         'dupe': 12,
         'kick': 7,
         'undeliverable': 25,
         'sell': 108,
         'finally': 51,
         'fedex': 5,
         'unique': 42,
         'real': 90,
         'anon': 10,
         'nobody': 12,
         'fold': 8,
         'generate': 67,
         'private': 30,
         'nospam': 1,
         'mil': 15,
         'bonus': 49,
         'enclose': 43,
         'monrose': 1,
         'mlmer': 1,
         'type': 183,
         'nondeliverable': 1,
         'key': 40,
         'actually': 54,
         'unless': 27,
         'adam': 8,
         ...})

In [10]:
Common_Words = { w for w, _ in Word_Counter.most_common(N) }
Common_Words


Out[10]:
{'load',
 'comprise',
 'familiar',
 'teen',
 'massive',
 'gamble',
 'none',
 'implementation',
 'majority',
 'cgibin',
 'rejection',
 'smaller',
 'launch',
 'lee',
 'database',
 'food',
 'window',
 'transfer',
 'candidate',
 'delay',
 'frank',
 'multus',
 'late',
 'engage',
 'work',
 'government',
 'newest',
 'call',
 'vulgarity',
 'material',
 'organisation',
 'affect',
 'hundreds',
 'propose',
 'john',
 'campus',
 'competition',
 'view',
 'penny',
 'currency',
 'gender',
 'class',
 'santa',
 'andrew',
 'bonus',
 'refinance',
 'organization',
 'eric',
 'site',
 'ongo',
 'italy',
 'sprachwissenschaft',
 'operator',
 'little',
 'appear',
 'ms',
 'actually',
 'perform',
 'monthly',
 'opposite',
 'latex',
 'job',
 'forum',
 'correct',
 'install',
 'miss',
 'local',
 'remain',
 'chain',
 'music',
 'ready',
 'hundr',
 'dinner',
 'bill',
 'singapore',
 'option',
 'multimedium',
 'dialect',
 'translation',
 'most',
 'different',
 'literature',
 'unite',
 'sit',
 'sun',
 'desirous',
 'bear',
 'scientist',
 'income',
 'urge',
 'life',
 'extensive',
 'label',
 'city',
 'july',
 'mouse',
 'win',
 'continent',
 'die',
 'tom',
 'diploma',
 'edit',
 'fulfill',
 'sequence',
 'lucky',
 'less',
 'manufacturer',
 'implication',
 'colingacl',
 'global',
 'referral',
 'western',
 'using',
 'map',
 'great',
 'response',
 'wrong',
 'bind',
 'analyse',
 'busy',
 'pb',
 'commerce',
 'ext',
 'polish',
 'worldwide',
 'previously',
 'peter',
 'studies',
 'appropriate',
 'mailbox',
 'again',
 'partner',
 'truly',
 'catch',
 'develop',
 'meeting',
 'title',
 'cfp',
 'parttime',
 'begin',
 'dozen',
 'addition',
 'artificial',
 'experiment',
 'mci',
 'around',
 'alexis',
 'april',
 'dictionary',
 'receive',
 'internet',
 'exercise',
 'edinburgh',
 'oversea',
 'paper',
 'charge',
 'j',
 'store',
 'nice',
 'raleigh',
 'six',
 'style',
 'www',
 'member',
 'activity',
 'y',
 'onetime',
 'comment',
 'status',
 'means',
 'reap',
 'chat',
 'utility',
 'usage',
 'beautiful',
 'extractor',
 'hotmail',
 'phrase',
 'doctor',
 'generally',
 'highly',
 'helpful',
 'access',
 'red',
 'ac',
 'usa',
 'structural',
 'likewise',
 'major',
 'wherea',
 'acl',
 'canadian',
 'cds',
 'mastercard',
 'programme',
 'living',
 'ps',
 'goe',
 'design',
 'end',
 'foot',
 'postscript',
 'empirical',
 'color',
 'corner',
 'unsubscribe',
 'change',
 'deal',
 'substantial',
 'planet',
 'michigan',
 'dollars',
 'simon',
 'trip',
 'award',
 'credit',
 'distinguish',
 'quantifier',
 'registration',
 'grateful',
 'doe',
 'hesitate',
 'psycholinguistic',
 'robert',
 'publication',
 'typical',
 'demo',
 'log',
 'e',
 'interest',
 'm',
 'tremendous',
 'simple',
 'phonological',
 'excellent',
 'enjoy',
 'genie',
 'preview',
 'december',
 'impossible',
 'america',
 'webmaster',
 'dori',
 'le',
 'de',
 'please',
 'reality',
 'heart',
 'weeks',
 'parse',
 'meet',
 'russian',
 'dear',
 'netherland',
 'speak',
 'floor',
 'blvd',
 'entirely',
 'clearance',
 'practical',
 'contents',
 'medium',
 'lay',
 'somewhat',
 'surely',
 'accompany',
 'buy',
 'judgment',
 'orders',
 'fairchild',
 'zero',
 'focus',
 'spout',
 'wish',
 'translate',
 'vendor',
 'faith',
 'sincerely',
 'mixe',
 'draw',
 'typology',
 'joan',
 'participation',
 'quote',
 'integrate',
 'recently',
 'vocabulary',
 'interaction',
 'kit',
 'cycle',
 'session',
 'august',
 'sometime',
 'non',
 'volumes',
 'anderson',
 'anna',
 'discussion',
 'diskette',
 'finding',
 'entire',
 'fl',
 'tip',
 'player',
 'traditional',
 'lie',
 'opportunity',
 'ltd',
 'shop',
 'front',
 'reread',
 'alway',
 'condition',
 'band',
 'sex',
 'eye',
 'proper',
 'century',
 'avenue',
 'buck',
 'motivation',
 'postfach',
 'macintosh',
 'ignore',
 'inquiry',
 'vary',
 'md',
 'poor',
 'move',
 'top',
 'capture',
 'european',
 'harri',
 'israel',
 'modify',
 'eventually',
 'conversation',
 'assessment',
 'produce',
 'coverage',
 'media',
 'click',
 'indo',
 'ma',
 'female',
 'genuine',
 'typological',
 'interdisciplinary',
 'predicate',
 'banner',
 'provide',
 'back',
 'generate',
 'independent',
 'june',
 'monday',
 'individual',
 'speed',
 'dan',
 'thing',
 'demand',
 'wealth',
 'value',
 'programs',
 'texa',
 'nt',
 'national',
 'introduction',
 'amazing',
 'hr',
 'intelligent',
 'request',
 'surface',
 'classified',
 'policy',
 'mediumsize',
 'plain',
 'hit',
 'im',
 'commonly',
 'star',
 'guy',
 'symposium',
 'w',
 'september',
 'forever',
 'notify',
 'rest',
 'repeat',
 'martin',
 'description',
 'undoubtedly',
 'myself',
 'zip',
 'kong',
 'happen',
 'direct',
 'percentage',
 'grammar',
 'delivery',
 'reply',
 'du',
 'weekend',
 'although',
 'french',
 'rich',
 'http',
 'stay',
 'scheme',
 'conceptual',
 'deep',
 'subject',
 'abuse',
 'consult',
 'below',
 'fastest',
 'organiser',
 'comparable',
 'currently',
 'influence',
 'suppose',
 'htm',
 'enhance',
 'video',
 'jone',
 'document',
 'mb',
 'consideration',
 'apology',
 'iro',
 'michael',
 'result',
 'bid',
 'until',
 'excess',
 'put',
 'exclude',
 'hi',
 'unlimit',
 'bring',
 'explore',
 'trend',
 'try',
 'numbers',
 'expect',
 'organise',
 'ed',
 'history',
 'favorite',
 'due',
 'nc',
 'promptly',
 'yours',
 'san',
 'reports',
 'documentation',
 'initially',
 'near',
 'birth',
 'park',
 'trash',
 'firm',
 'confident',
 'virtually',
 'syntactic',
 'evergrow',
 'quit',
 'register',
 'general',
 'stun',
 'benefit',
 'quality',
 'spain',
 'teacher',
 'research',
 'pic',
 'participant',
 'evaluation',
 'responsible',
 'perceive',
 'institution',
 'bernard',
 'off',
 'cooperation',
 'marketing',
 'north',
 'illustrate',
 'hardcore',
 'yield',
 'newsgroup',
 'fantastic',
 'science',
 'beach',
 'innovative',
 'treat',
 'signal',
 'run',
 'exactly',
 'brand',
 'pennsylvanium',
 'wait',
 'contact',
 'minute',
 'totally',
 'statistics',
 'state',
 'amateur',
 'researcher',
 'clean',
 'cluster',
 'open',
 'reconstruction',
 'chri',
 'perfectly',
 'help',
 'completely',
 'operate',
 'loss',
 'watch',
 'approve',
 'someone',
 'arise',
 'scope',
 'development',
 'unless',
 'version',
 'necessarily',
 'edition',
 'dutch',
 'novel',
 'trial',
 'juno',
 'fast',
 'millions',
 'twenty',
 'dramatically',
 'anywhere',
 'original',
 'acceptance',
 'downsize',
 'today',
 'exact',
 'weekly',
 'keynote',
 'forget',
 'characteristic',
 'txt',
 'even',
 'increase',
 'clear',
 'advertiser',
 'purchase',
 'date',
 'integration',
 'conversational',
 'adult',
 'classic',
 'plan',
 'earth',
 'bottom',
 'associate',
 'sales',
 'south',
 'comprehensive',
 'making',
 'transcription',
 'easily',
 'finger',
 'la',
 'mortgage',
 'add',
 'conceal',
 'verbal',
 'underlie',
 'long',
 'import',
 'analysis',
 'tool',
 'application',
 'perhap',
 'snail',
 'review',
 'easiest',
 'extremely',
 'though',
 'verify',
 'leave',
 'virtual',
 'dynamic',
 'recipient',
 'couple',
 'least',
 'germany',
 'useful',
 'client',
 'television',
 'prof',
 'scott',
 'phd',
 'importance',
 'gb',
 'korean',
 'order',
 'variation',
 'aim',
 'trust',
 'thoma',
 'corporations',
 'musical',
 'much',
 'lifetime',
 'intrusion',
 'set',
 'iii',
 'potential',
 'concept',
 'country',
 'hotel',
 'school',
 'master',
 'acquire',
 'laugh',
 'mo',
 'psychological',
 'club',
 'obviously',
 'debt',
 'extract',
 'vision',
 'million',
 'acquisition',
 'object',
 'human',
 'sake',
 'include',
 't',
 'success',
 'theoretical',
 'community',
 'vacation',
 'sample',
 'wh',
 'five',
 'finance',
 'search',
 'datum',
 'engine',
 'while',
 'everybody',
 'february',
 'author',
 'editor',
 'summarize',
 'fundamental',
 'publish',
 'blackwell',
 'yourself',
 'hello',
 'investigation',
 'universal',
 'karen',
 'jump',
 'umontreal',
 'follow',
 'professional',
 'emerge',
 'once',
 'day',
 'jame',
 'slip',
 'medical',
 'borrow',
 'song',
 'idea',
 'hour',
 'argument',
 'obvious',
 'listing',
 'november',
 'subscription',
 'shift',
 'pleasure',
 'london',
 'however',
 'fun',
 'tree',
 'man',
 'classroom',
 'dates',
 'mt',
 'accommodation',
 'cm',
 'alternative',
 'send',
 'diversity',
 'framework',
 'bag',
 'cognition',
 'relationship',
 'reasonable',
 'att',
 'exceed',
 'von',
 'during',
 'region',
 'accessible',
 'linguistics',
 'refer',
 'observe',
 'britain',
 'creditor',
 'specify',
 'moneymake',
 'relevance',
 'honor',
 'connection',
 'news',
 'parameter',
 'twelve',
 'cv',
 'mike',
 'parallel',
 'apply',
 'is',
 'here',
 'short',
 'password',
 'mellon',
 'textbook',
 'telephone',
 'morn',
 'paragraph',
 'educational',
 'worth',
 'effective',
 'reflect',
 'normal',
 'compute',
 'ba',
 'pretty',
 'comparative',
 'contribute',
 'latest',
 'minimum',
 'interpretation',
 'australium',
 'drop',
 'tell',
 'fill',
 'richer',
 'compile',
 'residual',
 'convention',
 'belgium',
 'martha',
 'functional',
 'variety',
 'white',
 'spell',
 'interview',
 'isp',
 'cognitive',
 'slavic',
 'u',
 'possibility',
 'speech',
 'hire',
 'spokane',
 'japanese',
 'education',
 'unify',
 'distribute',
 'perception',
 'enquiry',
 'president',
 'poster',
 'african',
 'town',
 'overload',
 'everythe',
 'testimonial',
 'death',
 'christian',
 'purpose',
 'return',
 'team',
 'duration',
 'cinema',
 'lot',
 'seem',
 'linguistic',
 'freedom',
 'v',
 'want',
 'cheap',
 'total',
 'dept',
 'literary',
 'second',
 'honest',
 'wife',
 'head',
 'agency',
 'final',
 'assume',
 'berlin',
 'alone',
 'compete',
 'three',
 'attach',
 'montreal',
 'movie',
 'consist',
 're',
 'incredible',
 'lack',
 'mistake',
 'satisfy',
 'fine',
 'middle',
 'ask',
 'capability',
 'guest',
 'discover',
 'resort',
 'amount',
 'define',
 'certainly',
 'problem',
 'organize',
 'believe',
 'resell',
 'survey',
 'whatsoever',
 'lisa',
 'cent',
 'largest',
 'requirement',
 'property',
 'msn',
 'mail',
 'loan',
 'domain',
 'released',
 'pour',
 'sum',
 'complete',
 'syntax',
 'symbol',
 'command',
 'sheffield',
 'interpret',
 'reviewer',
 'merciless',
 'nijmegen',
 'dupe',
 'industrial',
 'phonology',
 'create',
 'pittsburgh',
 'radio',
 'together',
 'sender',
 'privacy',
 'nature',
 'length',
 'contributor',
 'forthcome',
 'bet',
 'genre',
 'product',
 'expiration',
 'indiana',
 'nl',
 'obligation',
 'newsletter',
 'ii',
 'emailer',
 'exclusive',
 'note',
 'maintain',
 'assure',
 'rock',
 'quick',
 'retail',
 'copy',
 'x',
 'ad',
 'light',
 'serve',
 'toy',
 'press',
 'file',
 'days',
 'gold',
 'strength',
 'few',
 'eastern',
 'both',
 'dependency',
 'chomsky',
 'course',
 'morpheme',
 'financially',
 'device',
 'behind',
 'situation',
 'parent',
 'foundation',
 'bit',
 'orient',
 'stealth',
 'remember',
 'vium',
 'personally',
 'difficult',
 'money',
 'jp',
 'institute',
 'disc',
 'lyric',
 'correspondence',
 'cancel',
 'probably',
 'asset',
 'prompt',
 'equipment',
 'feature',
 'id',
 'conclude',
 'need',
 'sale',
 'truth',
 'automatically',
 'mix',
 'spend',
 'sorry',
 'index',
 'art',
 'fact',
 'department',
 'common',
 'estate',
 'publisher',
 'basically',
 'conjunction',
 'ram',
 'arrange',
 'whether',
 'example',
 'sure',
 'plenary',
 'love',
 'preparation',
 'actual',
 'cameraready',
 'cost',
 'point',
 'experience',
 'quickly',
 'thousands',
 'ultimate',
 'browser',
 'started',
 'reg',
 'coordinate',
 'rush',
 'network',
 'indefinite',
 'delete',
 'mailer',
 'package',
 'instead',
 'side',
 'already',
 'essential',
 'ever',
 'responsibility',
 'mit',
 'brief',
 'desire',
 'discovery',
 'royal',
 'update',
 'intelligence',
 'control',
 'present',
 'round',
 'sponsor',
 'occur',
 'philosophy',
 'current',
 'web',
 'text',
 'broadcast',
 'spot',
 'effect',
 'tradition',
 'tv',
 'germanic',
 ...}

Computing the Conditional Probabilities

Having computed the most common words, we are now ready to compute the conditional probability that a given word occurs in a spam email.

The function $\texttt{get_common_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument. It reads the file and returns the set of all words in Common_Words that are found in the given file.


In [11]:
def get_common_words(fn):
    return get_words(fn) & Common_Words

We test this function for a small email.


In [12]:
get_common_words('EmailData/ham-train/3-380msg4.txt')


Out[12]:
{'anyone',
 'article',
 'berkeley',
 'book',
 'consonant',
 'edu',
 'english',
 'hard',
 'helpful',
 'hi',
 'interest',
 'm',
 'modern',
 'phonetics',
 'project',
 'recommend',
 'source',
 'specifically',
 'thank',
 'too',
 'work'}

The function count_common_words takes a string specifying a directory. It returns a Counter that counts how often the words in Common_Words occur in any of the files in directory.


In [13]:
def count_commmon_words(directory):
    Words = Counter()
    for file_name in os.listdir(directory):
        Words.update(get_common_words(directory + file_name))
    return Words

Next, we compute dictionaries that store the number of occurrences in emails for every common word.


In [14]:
Spam_Counter = count_commmon_words(spam_dir_train)
Spam_Counter


Out[14]:
Counter({'earn': 51,
         'experience': 63,
         'through': 75,
         'phd': 6,
         'increase': 39,
         'grant': 12,
         'effort': 42,
         'choice': 23,
         'here': 146,
         'short': 38,
         'field': 33,
         'part': 50,
         'personal': 67,
         'programs': 15,
         'base': 42,
         'ba': 8,
         'phone': 93,
         'power': 30,
         'necessary': 25,
         'degree': 9,
         'further': 51,
         'detail': 55,
         'call': 132,
         'advance': 20,
         'require': 64,
         'award': 13,
         'present': 27,
         'knowledge': 30,
         'money': 140,
         'university': 15,
         'diploma': 7,
         'ma': 13,
         'cost': 99,
         'entire': 29,
         'conference': 6,
         'week': 104,
         'receive': 157,
         'start': 106,
         'our': 223,
         'delete': 39,
         'po': 27,
         'old': 40,
         'mailer': 15,
         'financial': 55,
         'member': 54,
         'problem': 47,
         'believe': 56,
         'ago': 29,
         'throw': 13,
         'customer': 52,
         'hello': 36,
         'letter': 67,
         'inexpensive': 16,
         'guarantee': 73,
         'ignore': 22,
         'complete': 54,
         'control': 30,
         'outside': 20,
         'cash': 69,
         'name': 133,
         'usa': 31,
         'state': 103,
         'texa': 5,
         'send': 154,
         'later': 35,
         'without': 66,
         'print': 56,
         'program': 99,
         'best': 123,
         'one': 168,
         'note': 59,
         'free': 198,
         'show': 76,
         'computer': 69,
         'credit': 71,
         'registration': 9,
         'must': 80,
         'process': 54,
         'center': 16,
         'today': 116,
         'weekly': 26,
         'mind': 33,
         'zip': 52,
         'interest': 98,
         'compound': 3,
         'few': 68,
         'address': 166,
         'simple': 72,
         'telephone': 31,
         'educational': 3,
         'main': 22,
         'worth': 32,
         'entitle': 4,
         'convert': 8,
         'plan': 41,
         's': 219,
         'message': 106,
         'join': 54,
         'number': 106,
         'respond': 24,
         'box': 56,
         'achieve': 21,
         'card': 70,
         'life': 65,
         'solution': 13,
         'mortgage': 17,
         'please': 188,
         'city': 69,
         'information': 153,
         'especially': 23,
         'net': 68,
         'id': 21,
         'participate': 31,
         'us': 156,
         'independence': 12,
         'tuesday': 2,
         'enable': 9,
         'company': 102,
         'over': 146,
         'simply': 79,
         'night': 18,
         'pm': 26,
         'intrusion': 15,
         'return': 65,
         'solid': 14,
         'establish': 15,
         'mean': 17,
         'freedom': 36,
         'form': 63,
         'begin': 32,
         'system': 65,
         'debt': 30,
         'obtain': 19,
         'secure': 25,
         'per': 79,
         'pack': 13,
         'vacation': 31,
         'west': 6,
         'pay': 92,
         'e': 87,
         'home': 101,
         'accomodation': 1,
         'wonderful': 10,
         'three': 25,
         'buck': 7,
         'room': 12,
         'mail': 179,
         'reserve': 16,
         'stay': 21,
         'season': 7,
         'announce': 13,
         'want': 145,
         'follow': 118,
         'space': 16,
         'com': 160,
         'compuserve': 17,
         'day': 154,
         'lunch': 8,
         'book': 33,
         'doe': 9,
         'additional': 51,
         'million': 85,
         'reach': 43,
         'commercial': 19,
         'info': 47,
         'future': 74,
         'success': 60,
         'internet': 124,
         'software': 67,
         'network': 22,
         'search': 60,
         'permanently': 8,
         'area': 43,
         'evaluation': 4,
         'proper': 7,
         'requirement': 13,
         'presence': 9,
         'section': 31,
         'stop': 38,
         'regard': 20,
         'propose': 9,
         'web': 93,
         'advantage': 45,
         'sender': 23,
         'certain': 16,
         'help': 89,
         'remove': 150,
         'target': 23,
         'product': 98,
         'fellow': 16,
         'promote': 24,
         'luck': 23,
         'basis': 18,
         'request': 80,
         'comply': 17,
         'recent': 7,
         'lead': 19,
         'mailing': 55,
         'bill': 54,
         'selection': 9,
         'c': 49,
         'reply': 89,
         'ten': 17,
         'paragraph': 9,
         'post': 30,
         'unite': 31,
         'transmission': 8,
         'gov': 20,
         'http': 157,
         'entrepreneur': 10,
         'subject': 102,
         'tool': 36,
         'service': 108,
         'dear': 37,
         'business': 114,
         'assist': 9,
         'level': 44,
         'need': 145,
         'sale': 57,
         'thoma': 5,
         'item': 11,
         'much': 102,
         'try': 75,
         'set': 43,
         'wish': 85,
         'thank': 89,
         'market': 102,
         'email': 185,
         'online': 85,
         'federal': 27,
         'audience': 7,
         'park': 16,
         'check': 126,
         'greatest': 27,
         're': 104,
         'onto': 6,
         'release': 30,
         'include': 129,
         'player': 10,
         'visit': 80,
         'ultimate': 10,
         'shop': 26,
         'while': 46,
         'chart': 8,
         'cd': 41,
         'never': 79,
         'package': 48,
         'alway': 57,
         'www': 110,
         'band': 8,
         'why': 65,
         'event': 15,
         'full': 65,
         'digital': 16,
         'right': 96,
         'delay': 9,
         'yourself': 72,
         'late': 13,
         'friend': 56,
         'easy': 89,
         'available': 93,
         'beautiful': 20,
         'chance': 43,
         'fantastic': 27,
         'top': 50,
         'pick': 32,
         'run': 48,
         'access': 45,
         'john': 14,
         'competition': 25,
         'click': 100,
         'offer': 143,
         'pop': 14,
         'n': 27,
         'roll': 18,
         'big': 46,
         'sound': 29,
         'radio': 17,
         'down': 67,
         'play': 29,
         'provide': 71,
         'london': 11,
         'thing': 57,
         'fun': 44,
         'site': 121,
         'record': 32,
         'prepare': 22,
         'nt': 127,
         'true': 44,
         'unsubscribe': 20,
         'b': 40,
         'technology': 29,
         'miss': 42,
         'exclusive': 23,
         'capitalfm': 17,
         'hit': 42,
         'thursday': 3,
         'save': 81,
         'straight': 6,
         'choose': 52,
         'question': 76,
         'rock': 7,
         'star': 16,
         'music': 14,
         'europe': 12,
         'hesitate': 25,
         'graphic': 12,
         'storm': 7,
         'instant': 15,
         'supply': 13,
         'special': 85,
         'll': 107,
         'regular': 16,
         'hurry': 8,
         'many': 117,
         'even': 99,
         'reveal': 12,
         'too': 46,
         'attend': 7,
         've': 81,
         'website': 39,
         'list': 166,
         'classic': 3,
         'absolutely': 39,
         'south': 15,
         'enter': 50,
         'latest': 36,
         'pc': 18,
         'prize': 16,
         'label': 8,
         'tell': 71,
         'fill': 55,
         'win': 87,
         'paradise': 7,
         'stock': 23,
         'thompson': 5,
         'discount': 18,
         'couple': 21,
         'guess': 8,
         'u': 48,
         'entirely': 4,
         'amaze': 37,
         'link': 53,
         'advertisement': 42,
         'better': 57,
         'william': 6,
         'feel': 33,
         'become': 42,
         'manufacturer': 9,
         'album': 12,
         'game': 28,
         'still': 52,
         'buy': 79,
         'primary': 8,
         'bring': 35,
         'screen': 12,
         'president': 7,
         'biz': 9,
         'surround': 5,
         'poster': 1,
         'everythe': 15,
         'fm': 10,
         'focus': 5,
         'talk': 29,
         'team': 20,
         'mailbox': 24,
         'next': 69,
         'catch': 16,
         'favourite': 9,
         'world': 80,
         'laugh': 12,
         'whether': 20,
         'performance': 8,
         'hot': 27,
         'head': 12,
         'capital': 43,
         'movie': 19,
         'major': 57,
         'submission': 11,
         'automatically': 31,
         'report': 70,
         'notice': 7,
         'engine': 34,
         'advertiser': 14,
         'within': 89,
         'bulk': 58,
         'after': 80,
         'each': 96,
         'etc': 53,
         'every': 114,
         'appropriate': 5,
         'page': 57,
         'toll': 36,
         'monthly': 28,
         'pro': 10,
         'hr': 15,
         'extractor': 11,
         'block': 17,
         'month': 94,
         'review': 23,
         'submit': 19,
         'media': 3,
         'tag': 3,
         'thousands': 10,
         'solve': 4,
         'something': 32,
         'spam': 26,
         'reg': 11,
         'dollar': 73,
         'powerful': 27,
         'quickly': 29,
         'community': 7,
         't': 73,
         'high': 47,
         'literally': 9,
         'general': 13,
         'along': 30,
         'travel': 31,
         'ask': 67,
         'benefit': 27,
         'oversea': 7,
         'paper': 34,
         'finance': 14,
         'promise': 23,
         'legally': 10,
         'amount': 61,
         'clearly': 7,
         'confirm': 9,
         'certainly': 11,
         'upon': 23,
         'contract': 10,
         'word': 38,
         'extra': 42,
         'thousand': 71,
         'means': 18,
         'work': 123,
         'soon': 50,
         'monitor': 8,
         'before': 89,
         'themselve': 21,
         'vary': 8,
         'march': 10,
         'move': 38,
         'ca': 43,
         'under': 55,
         'exactly': 38,
         'kid': 14,
         'public': 21,
         'view': 17,
         'greatly': 9,
         'earlier': 4,
         'contact': 67,
         'likewise': 4,
         'currency': 10,
         'minute': 41,
         'wall': 13,
         'create': 56,
         'daily': 22,
         'yet': 20,
         'reason': 42,
         'effect': 5,
         'editorial': 3,
         'santa': 4,
         'optional': 13,
         'back': 85,
         'completely': 36,
         'end': 47,
         'individual': 25,
         'operate': 22,
         'organization': 23,
         'however': 27,
         'watch': 36,
         'someone': 48,
         'rate': 61,
         'wealth': 10,
         'fortune': 22,
         'own': 96,
         'political': 3,
         'membership': 19,
         'corner': 8,
         'national': 13,
         'change': 59,
         'payable': 35,
         'dollars': 19,
         'write': 55,
         'o': 42,
         'overnight': 25,
         'let': 65,
         'group': 34,
         'first': 110,
         'assure': 9,
         'profile': 10,
         'same': 69,
         'attention': 22,
         'publication': 12,
         'continue': 33,
         'postage': 18,
         'else': 51,
         'gold': 20,
         'instruction': 64,
         'nor': 23,
         'm': 78,
         'fee': 36,
         'most': 112,
         'date': 53,
         'different': 62,
         'announcement': 6,
         'concern': 12,
         'glad': 7,
         'unlike': 10,
         'earth': 21,
         'able': 46,
         'parent': 8,
         'easily': 40,
         'anyone': 61,
         'add': 79,
         'york': 18,
         'depend': 15,
         'long': 39,
         'below': 99,
         'allow': 60,
         'action': 37,
         'street': 37,
         'operation': 10,
         'exist': 23,
         'legal': 42,
         'advice': 6,
         'extremely': 25,
         'leave': 57,
         'cancel': 8,
         'important': 45,
         'californium': 9,
         'refund': 29,
         'american': 35,
         'document': 11,
         'confidential': 13,
         'hand': 45,
         'read': 85,
         'conclude': 12,
         'keep': 80,
         'grow': 22,
         'until': 44,
         'surely': 16,
         'hi': 24,
         'secret': 38,
         'global': 17,
         'unlimit': 20,
         'profit': 48,
         'enquiry': 1,
         'don': 42,
         'great': 81,
         'line': 93,
         'learn': 53,
         'ship': 42,
         'immediately': 51,
         'those': 77,
         'instruct': 15,
         'limited': 17,
         'ourselve': 9,
         'worldwide': 26,
         'purpose': 13,
         'source': 21,
         'plus': 60,
         'again': 78,
         'office': 48,
         'school': 16,
         'low': 33,
         'hundred': 48,
         'total': 47,
         'd': 54,
         'recently': 21,
         'second': 23,
         'suite': 36,
         'exchange': 19,
         'share': 42,
         'method': 39,
         'extract': 15,
         'around': 33,
         'tip': 19,
         'listen': 13,
         'excite': 42,
         'teacher': 1,
         'everybody': 7,
         'beer': 5,
         'answer': 50,
         'century': 10,
         'ever': 78,
         'love': 38,
         'chat': 16,
         'universal': 3,
         'channel': 9,
         'globe': 5,
         'zone': 8,
         'hottest': 18,
         'uk': 21,
         'red': 8,
         'song': 6,
         'wait': 51,
         'past': 42,
         'tv': 18,
         'compzone': 7,
         'june': 10,
         'man': 17,
         'forthcome': 4,
         'rd': 20,
         'piece': 32,
         'break': 45,
         'bag': 7,
         'stress': 6,
         'fabulous': 17,
         'live': 69,
         'highlight': 6,
         'ad': 49,
         'martin': 3,
         'angele': 7,
         'beverage': 6,
         'st': 38,
         'saturday': 13,
         'co': 18,
         'delivery': 33,
         'summer': 8,
         'both': 45,
         'weekend': 12,
         'th': 45,
         'where': 90,
         'la': 8,
         'professor': 3,
         'holiday': 17,
         'meet': 31,
         'lyric': 5,
         'video': 39,
         'size': 22,
         'nd': 26,
         'bargain': 12,
         'vote': 9,
         'prof': 2,
         'ticket': 24,
         'feature': 22,
         'prior': 23,
         'carefully': 20,
         'really': 65,
         'order': 130,
         'film': 10,
         'musical': 5,
         'wednesday': 4,
         'expensive': 17,
         'boy': 24,
         'winner': 16,
         'cinema': 9,
         'lot': 52,
         'title': 33,
         'lo': 7,
         'everyone': 48,
         'kit': 10,
         'mark': 14,
         'character': 3,
         'price': 84,
         'preparation': 4,
         'girl': 20,
         'xxx': 21,
         'teen': 15,
         'trial': 23,
         'index': 19,
         'adult': 42,
         'html': 42,
         'mci': 12,
         'z': 5,
         'range': 14,
         'familiar': 7,
         'several': 49,
         'blank': 12,
         'numerous': 10,
         'duplicate': 22,
         'circle': 6,
         'finish': 12,
         'opportunity': 73,
         'extension': 6,
         'international': 38,
         'user': 22,
         'contain': 43,
         'possible': 40,
         'broad': 1,
         'bank': 51,
         'fund': 18,
         'almost': 32,
         'off': 70,
         'ours': 12,
         'canada': 12,
         'cause': 10,
         'released': 10,
         'risk': 36,
         'newsgroup': 10,
         'cleanest': 11,
         'vulgarity': 7,
         'cut': 26,
         'mine': 13,
         'highly': 19,
         'sort': 17,
         'fax': 51,
         'filter': 19,
         'produce': 33,
         'wrap': 15,
         'dure': 3,
         'download': 30,
         'dupe': 8,
         'undeliverable': 18,
         'sell': 84,
         'finally': 34,
         'unique': 27,
         'real': 52,
         'anon': 6,
         'nobody': 9,
         'private': 23,
         'generate': 43,
         'mil': 9,
         'bonus': 41,
         'enclose': 28,
         'type': 84,
         'key': 20,
         'actually': 24,
         'unless': 18,
         'fast': 27,
         'place': 81,
         'yes': 30,
         'remain': 11,
         'valid': 14,
         'close': 22,
         'specifically': 6,
         'since': 37,
         'w': 18,
         'file': 55,
         'huge': 37,
         'is': 83,
         'tremendous': 11,
         'small': 41,
         'password': 8,
         'purchase': 73,
         'are': 56,
         'against': 33,
         'anything': 43,
         'course': 40,
         'edu': 9,
         'average': 18,
         'directory': 23,
         'eliminate': 25,
         'replace': 16,
         'super': 24,
         'production': 10,
         'bottom': 23,
         'clock': 7,
         'server': 27,
         'account': 46,
         'rich': 24,
         'gather': 7,
         'webmaster': 12,
         'marketer': 14,
         'envelope': 27,
         'postmaster': 6,
         'abuse': 10,
         'stealth': 22,
         'whole': 24,
         'inside': 15,
         'ensure': 7,
         'org': 14,
         'vium': 45,
         'faster': 23,
         'removal': 9,
         'investment': 36,
         'longer': 23,
         'classify': 10,
         'cdrom': 12,
         'pure': 12,
         'isp': 17,
         'road': 25,
         'less': 57,
         'client': 18,
         'result': 56,
         'bid': 18,
         'excess': 18,
         'put': 77,
         'reduce': 25,
         'fresh': 42,
         'otherwise': 14,
         'using': 27,
         'response': 44,
         'combine': 14,
         'fact': 43,
         'tout': 6,
         'addresses': 19,
         'numbers': 10,
         'collect': 24,
         'country': 41,
         'due': 26,
         'seem': 18,
         'flame': 10,
         'prodigy': 7,
         'sign': 41,
         'dozen': 5,
         'test': 35,
         'example': 35,
         'near': 15,
         'sure': 65,
         'lists': 20,
         'consist': 3,
         'actual': 9,
         'diskette': 8,
         'fine': 10,
         'act': 21,
         'doubt': 27,
         'magazine': 18,
         'window': 33,
         'compress': 7,
         'stimulate': 4,
         'activity': 12,
         'whatsoever': 13,
         'comment': 11,
         'position': 35,
         'multiple': 16,
         'macintosh': 5,
         'utility': 12,
         'everything': 50,
         'meg': 7,
         'treat': 22,
         'intelligence': 15,
         'command': 6,
         'once': 68,
         'conversation': 6,
         'compatible': 6,
         'disk': 13,
         'girlfriend': 4,
         'design': 34,
         'rom': 11,
         'above': 62,
         'mac': 9,
         'differently': 4,
         'woman': 15,
         'king': 6,
         'protection': 13,
         'install': 9,
         'celebrity': 7,
         'correct': 14,
         'copy': 64,
         'guy': 15,
         'code': 58,
         'personality': 4,
         'x': 46,
         'either': 33,
         'toy': 9,
         'existence': 6,
         'voice': 14,
         'likes': 6,
         'hear': 42,
         'hard': 44,
         'unmark': 6,
         'ibm': 7,
         'boyfriend': 5,
         'sexual': 12,
         'reality': 8,
         'turn': 41,
         'model': 8,
         'remember': 45,
         'deat': 28,
         'higher': 15,
         'continent': 5,
         'interactive': 11,
         'realistic': 9,
         'guide': 29,
         'drive': 23,
         'relate': 17,
         'virtual': 10,
         'blvd': 12,
         'least': 44,
         'upset': 7,
         'obey': 4,
         'beg': 5,
         'sexually': 8,
         'attitude': 5,
         'ram': 7,
         'inform': 13,
         'partner': 27,
         'v': 28,
         'blast': 7,
         'club': 18,
         'artificial': 6,
         'clothe': 10,
         'imagine': 33,
         'porn': 8,
         'handle': 26,
         'sex': 22,
         'story': 20,
         'picture': 16,
         'birth': 6,
         'none': 6,
         'rejection': 1,
         'charge': 48,
         'responsible': 10,
         'north': 7,
         'qualify': 23,
         'law': 33,
         'perform': 11,
         'job': 48,
         'annual': 12,
         'conduct': 4,
         'creditor': 15,
         'bankruptcy': 21,
         'regardless': 10,
         'match': 9,
         'apply': 18,
         'bad': 12,
         'excellent': 24,
         'income': 69,
         'payment': 38,
         'express': 28,
         'application': 15,
         'seek': 13,
         'security': 41,
         'made': 9,
         'nj': 8,
         'student': 13,
         'prompt': 16,
         'deposit': 21,
         'resource': 22,
         'history': 13,
         'guaranteed': 21,
         'signature': 29,
         'savings': 12,
         'final': 13,
         'datum': 9,
         'recieve': 11,
         'text': 31,
         'clean': 19,
         'open': 45,
         'together': 20,
         'cheque': 5,
         'value': 30,
         'unsolicit': 14,
         'england': 6,
         'clear': 15,
         'direct': 36,
         'minimum': 10,
         'import': 8,
         'disc': 5,
         'recipient': 13,
         'fully': 22,
         'quote': 10,
         'pound': 7,
         'normally': 8,
         'resident': 11,
         'virtually': 14,
         'collection': 15,
         'select': 48,
         'resell': 22,
         'cent': 19,
         'msn': 9,
         'marketing': 29,
         'ability': 23,
         'management': 12,
         'compare': 11,
         'class': 26,
         'hour': 93,
         'mastercard': 35,
         'nothing': 48,
         'copyright': 17,
         'speed': 20,
         'accept': 50,
         'tree': 7,
         'mass': 16,
         'expiration': 21,
         'deal': 34,
         'visa': 40,
         'anywhere': 49,
         'aol': 39,
         'ready': 40,
         'dream': 46,
         'reward': 10,
         'smith': 4,
         'sales': 22,
         'person': 47,
         'function': 6,
         'step': 55,
         'setup': 9,
         'currently': 24,
         'hours': 14,
         'stepby': 15,
         'tax': 28,
         'touch': 10,
         'thesis': 2,
         'kind': 25,
         'yours': 54,
         'provider': 17,
         'rights': 22,
         'volume': 15,
         'trash': 13,
         'satisfy': 19,
         'period': 23,
         'thereafter': 9,
         'sample': 18,
         'separate': 11,
         'quality': 25,
         'services': 15,
         ...})

In [15]:
Ham__Counter = count_commmon_words(ham__dir_train)
Ham__Counter


Out[15]:
Counter({'range': 29,
         'comprise': 4,
         'through': 33,
         'future': 20,
         'lab': 9,
         'practice': 11,
         'coordinate': 7,
         'language': 241,
         'international': 76,
         'research': 116,
         'promise': 5,
         'area': 72,
         'broad': 10,
         'www': 116,
         'fund': 12,
         'identify': 30,
         'pari': 15,
         'canada': 28,
         'work': 99,
         'sunday': 9,
         'call': 119,
         'umontreal': 7,
         'follow': 130,
         'assess': 9,
         'therefore': 16,
         'syntax': 65,
         'israel': 8,
         'modify': 8,
         'present': 79,
         'ca': 38,
         'outside': 10,
         'tag': 7,
         'view': 33,
         'usa': 53,
         'current': 32,
         'state': 57,
         'researcher': 40,
         'face': 13,
         'together': 43,
         'programme': 35,
         'morphology': 36,
         'provide': 86,
         'html': 56,
         'examine': 16,
         'individual': 27,
         'accept': 47,
         'arabic': 10,
         'own': 32,
         'target': 8,
         'mt': 7,
         'pre': 9,
         'little': 13,
         'computational': 38,
         'national': 30,
         'forum': 27,
         'coordinator': 9,
         'specifically': 8,
         'bell': 9,
         'europe': 20,
         'registration': 52,
         'bar': 4,
         'france': 21,
         'either': 40,
         'description': 39,
         'c': 79,
         'mike': 7,
         'direct': 27,
         'committee': 50,
         'short': 25,
         'consequence': 15,
         'workshop': 71,
         'hebrew': 8,
         'date': 35,
         'concern': 36,
         'theme': 27,
         'although': 30,
         'edu': 105,
         'centre': 29,
         'xerox': 12,
         'http': 137,
         'support': 37,
         'subject': 44,
         'generation': 19,
         'where': 57,
         'exist': 25,
         'university': 201,
         'parse': 14,
         'papers': 99,
         'possibility': 18,
         'iro': 7,
         'michael': 32,
         'result': 41,
         'colingacl': 7,
         'aim': 41,
         'approach': 71,
         'art': 24,
         'much': 37,
         'each': 53,
         'common': 23,
         'george': 15,
         'potential': 16,
         'collect': 8,
         'develop': 32,
         'body': 8,
         'august': 42,
         'session': 47,
         'final': 33,
         'montreal': 14,
         'challenge': 17,
         'contact': 89,
         'process': 69,
         're': 52,
         'homepage': 13,
         'susan': 14,
         'text': 77,
         'web': 65,
         'robert': 28,
         'benjamin': 19,
         'connection': 11,
         'speech': 57,
         'read': 32,
         'visit': 28,
         'editorial': 8,
         'william': 18,
         'chri': 2,
         'order': 73,
         'h': 34,
         'm': 79,
         'l': 51,
         'j': 44,
         'et': 17,
         'grammar': 52,
         'relation': 26,
         'site': 47,
         'development': 56,
         'bank': 10,
         'pattern': 22,
         'resource': 24,
         'christian': 8,
         'word': 114,
         'ed': 31,
         'g': 61,
         'nl': 42,
         'function': 25,
         'locate': 9,
         'linguistic': 170,
         'mean': 43,
         'semantics': 53,
         'english': 125,
         'life': 12,
         'le': 21,
         'sign': 17,
         'de': 87,
         'social': 30,
         'paul': 28,
         'please': 133,
         'verbal': 9,
         'total': 9,
         'note': 47,
         'harri': 9,
         'lexical': 44,
         'elizabeth': 9,
         'matter': 14,
         'verb': 39,
         'john': 60,
         'linguistics': 103,
         'theory': 71,
         'natural': 52,
         'philosophy': 14,
         'ac': 54,
         'information': 174,
         'k': 35,
         'thompson': 4,
         'dynamic': 12,
         'industry': 3,
         'million': 5,
         'experience': 33,
         'include': 130,
         'conference': 99,
         'expense': 10,
         'implementation': 10,
         'effort': 15,
         'benefit': 4,
         'software': 23,
         'database': 15,
         'window': 3,
         'strong': 9,
         'theart': 3,
         'candidate': 11,
         'position': 43,
         'fax': 130,
         'science': 61,
         'phonetics': 18,
         'complete': 30,
         'signal': 9,
         'prefer': 19,
         'n': 28,
         'phonology': 45,
         'prosodic': 11,
         'jean': 13,
         'advantage': 8,
         'two': 79,
         'design': 23,
         'length': 27,
         'enclose': 5,
         'between': 79,
         'mac': 6,
         'send': 109,
         'statistical': 18,
         'break': 24,
         'substantial': 12,
         'job': 21,
         'inc': 18,
         'skill': 10,
         'scientific': 17,
         'house': 15,
         'knowledge': 23,
         'engineer': 15,
         'computer': 44,
         'salary': 7,
         'x': 43,
         'graphic': 8,
         'center': 31,
         'acoustic': 6,
         'singapore': 10,
         'publication': 53,
         'tel': 61,
         'e': 131,
         'successful': 19,
         'apply': 50,
         'mr': 11,
         'both': 83,
         'personal': 11,
         'telephone': 33,
         'post': 49,
         's': 189,
         'join': 13,
         'project': 33,
         'sun': 8,
         'number': 80,
         'scientist': 10,
         'desirable': 5,
         'model': 45,
         'analysis': 70,
         'tool': 22,
         'institute': 52,
         'relevant': 24,
         'californium': 32,
         'technical': 24,
         'least': 27,
         'less': 13,
         'phd': 9,
         'us': 72,
         'need': 54,
         'encourage': 23,
         'preferably': 20,
         'degree': 18,
         'stateof': 3,
         'email': 136,
         'require': 34,
         'interaction': 31,
         'system': 59,
         'chinese': 16,
         'content': 33,
         'end': 31,
         'official': 11,
         'fuer': 12,
         'later': 25,
         'application': 52,
         'yet': 15,
         'begin': 21,
         'mid': 2,
         'inform': 8,
         'period': 10,
         'six': 13,
         'sincerely': 4,
         'keep': 8,
         'january': 26,
         'week': 22,
         'sprachwissenschaft': 9,
         'expect': 18,
         'r': 40,
         'student': 65,
         'cognitive': 43,
         'issue': 77,
         'october': 26,
         'oxford': 13,
         'press': 21,
         'upto': 5,
         'paper': 100,
         'f': 30,
         'most': 52,
         'pp': 38,
         'learn': 34,
         'key': 13,
         'study': 85,
         'wide': 33,
         'history': 23,
         'concept': 20,
         'introduction': 24,
         'brief': 20,
         'title': 61,
         'overview': 12,
         'org': 18,
         'second': 60,
         'accessible': 11,
         'cloth': 16,
         'first': 100,
         'cover': 32,
         'book': 79,
         'third': 19,
         'act': 13,
         'secretary': 7,
         'theoretical': 41,
         'patrick': 10,
         'general': 61,
         'ask': 34,
         'p': 68,
         'po': 17,
         'representation': 35,
         'package': 4,
         'author': 72,
         'dialogue': 18,
         'publish': 57,
         'page': 89,
         'available': 85,
         'preliminary': 8,
         'jame': 18,
         'interface': 23,
         'jan': 18,
         'formal': 30,
         'prove': 6,
         'november': 19,
         'inference': 11,
         'postscript': 20,
         'dates': 13,
         'prepare': 14,
         'version': 24,
         'further': 63,
         'b': 44,
         'latex': 11,
         'notification': 34,
         'place': 54,
         'o': 39,
         'limit': 44,
         'van': 26,
         'steve': 7,
         'original': 28,
         'tilburg': 11,
         'acceptance': 32,
         'proceedings': 20,
         'host': 15,
         'september': 31,
         'involve': 28,
         'selection': 15,
         'aspect': 54,
         'interest': 112,
         'invite': 74,
         'chair': 29,
         'initial': 18,
         'box': 39,
         'room': 21,
         'interpretation': 23,
         'professor': 25,
         'htm': 11,
         'martha': 7,
         'netherland': 30,
         'important': 44,
         'submission': 64,
         'bring': 42,
         'index': 22,
         'topics': 10,
         'semantic': 51,
         'phone': 54,
         'department': 74,
         'focus': 55,
         'context': 40,
         'due': 35,
         'office': 19,
         'anne': 17,
         'guideline': 13,
         'form': 83,
         'mark': 36,
         'topic': 74,
         'faculty': 20,
         'submit': 63,
         'preparation': 6,
         'technique': 16,
         'lead': 31,
         'discussion': 81,
         'cluster': 9,
         'parameter': 10,
         'real': 12,
         'recognition': 23,
         'linguist': 71,
         'principle': 29,
         'help': 41,
         'background': 19,
         'andrew': 14,
         'decision': 12,
         'algorithm': 9,
         'datum': 47,
         'our': 44,
         'enable': 8,
         'hide': 3,
         'tree': 3,
         'statement': 21,
         'affiliation': 47,
         'clearly': 19,
         'select': 30,
         'editor': 32,
         'suitable': 11,
         'reflect': 20,
         'list': 75,
         'distribution': 14,
         'message': 36,
         'criterion': 13,
         'set': 38,
         'goal': 14,
         'goodness': 2,
         'mit': 24,
         'series': 19,
         'foundation': 14,
         'valuable': 6,
         'request': 32,
         'communication': 46,
         'maximum': 17,
         'below': 52,
         'underlie': 12,
         'isbn': 18,
         'method': 34,
         'show': 44,
         'review': 44,
         'reader': 15,
         'decade': 8,
         'reviewer': 9,
         'document': 21,
         'abstracts': 9,
         'choice': 10,
         'website': 19,
         'address': 113,
         'field': 59,
         'style': 27,
         'educational': 9,
         'organizer': 26,
         'discourse': 51,
         'market': 15,
         'december': 27,
         'methodology': 17,
         'contribution': 24,
         'structure': 59,
         'case': 57,
         'variable': 7,
         'psychology': 18,
         'documentation': 5,
         'affect': 8,
         'informative': 4,
         'corpus': 29,
         'audience': 8,
         'deadline': 55,
         'april': 42,
         'italian': 15,
         'ignore': 12,
         'interpret': 14,
         'brian': 11,
         'texa': 19,
         'operator': 3,
         'category': 23,
         'ii': 27,
         'url': 23,
         'china': 6,
         'must': 58,
         'austin': 10,
         'cv': 12,
         'translation': 35,
         'directory': 6,
         'electronic': 29,
         'amsterdam': 20,
         'long': 24,
         'il': 14,
         'meet': 40,
         'u': 41,
         'david': 34,
         'un': 11,
         'simply': 16,
         'iii': 15,
         'translate': 9,
         'world': 48,
         'school': 39,
         'v': 45,
         'd': 94,
         'approximately': 13,
         'price': 23,
         'ad': 6,
         'idea': 24,
         'recent': 37,
         'thanks': 7,
         'commercial': 9,
         'equipment': 8,
         'file': 21,
         'even': 36,
         'middle': 8,
         'along': 22,
         'walk': 6,
         'opportunity': 15,
         'someone': 12,
         'eastern': 11,
         'expression': 17,
         'man': 13,
         'wonder': 9,
         'different': 56,
         'old': 17,
         'confirm': 9,
         'instead': 8,
         'glad': 3,
         'those': 58,
         'french': 38,
         'talk': 34,
         'mass': 7,
         'sit': 1,
         'foreign': 14,
         'thank': 38,
         'print': 22,
         'ibm': 4,
         'lose': 10,
         'country': 26,
         'discuss': 38,
         'respond': 8,
         'mary': 8,
         'write': 68,
         'anyone': 37,
         'want': 25,
         'one': 125,
         'input': 8,
         'service': 18,
         'question': 92,
         'assume': 19,
         'though': 18,
         'grateful': 7,
         'early': 21,
         'three': 48,
         'speak': 44,
         'attention': 26,
         'probably': 10,
         'bite': 5,
         'demonstrate': 12,
         'useful': 14,
         'net': 7,
         'option': 7,
         'feature': 42,
         'keyword': 9,
         'sale': 2,
         'else': 15,
         'still': 21,
         'search': 6,
         'engine': 6,
         'teacher': 20,
         'query': 25,
         'conclusion': 13,
         'spanish': 20,
         'similar': 16,
         'response': 18,
         'quite': 21,
         'try': 16,
         'au': 13,
         'etc': 57,
         'fact': 26,
         'nt': 34,
         'excellent': 4,
         'digital': 5,
         'colleague': 16,
         'true': 13,
         'polish': 10,
         'return': 14,
         'puzzle': 7,
         'seem': 33,
         'next': 15,
         'waste': 2,
         'door': 1,
         'decide': 8,
         'again': 15,
         'favourite': 1,
         'indeed': 17,
         'yes': 5,
         'tell': 22,
         'mine': 10,
         'fail': 6,
         'turn': 12,
         'com': 33,
         'build': 27,
         'surprise': 9,
         'perhap': 22,
         'returns': 2,
         'large': 20,
         'head': 24,
         'experiment': 10,
         'extremely': 8,
         'offer': 34,
         'relate': 57,
         'notion': 21,
         'far': 19,
         'guy': 7,
         'size': 10,
         'volume': 42,
         'vol': 14,
         'syntactic': 30,
         'll': 6,
         'june': 31,
         'edinburgh': 10,
         'quality': 18,
         'assistant': 9,
         'vowel': 12,
         'dutch': 17,
         'king': 10,
         'stress': 6,
         'uk': 46,
         'ling': 20,
         'additional': 27,
         'point': 44,
         'intend': 27,
         'receive': 53,
         'register': 24,
         'home': 24,
         'separate': 18,
         'ascii': 13,
         'february': 27,
         'inch': 5,
         'anybody': 11,
         'inquiry': 13,
         'march': 38,
         'name': 84,
         'minute': 37,
         'compare': 14,
         'speaker': 75,
         'type': 56,
         'margin': 8,
         'charle': 16,
         'announce': 37,
         'during': 19,
         'copy': 64,
         'perspective': 41,
         'seminar': 10,
         'universitaet': 13,
         'st': 28,
         'anonymous': 18,
         'announcement': 26,
         'th': 67,
         'abstract': 81,
         'card': 17,
         'vium': 41,
         'format': 32,
         'pure': 3,
         'correspondence': 13,
         'nd': 17,
         'reference': 72,
         'slavic': 10,
         'germany': 45,
         'negation': 9,
         'participate': 16,
         'speakers': 10,
         'accompany': 10,
         'maria': 14,
         'acceptable': 7,
         'publisher': 14,
         'arrange': 5,
         'standard': 30,
         'detail': 47,
         'participation': 28,
         'famous': 7,
         'several': 32,
         'lie': 7,
         'rejection': 8,
         'onepage': 13,
         'teach': 35,
         'late': 12,
         'before': 49,
         'day': 30,
         'run': 11,
         'under': 28,
         'campus': 15,
         'comparison': 18,
         'compatible': 2,
         'structural': 15,
         'major': 35,
         'cultural': 13,
         'santa': 8,
         'west': 12,
         'nature': 22,
         'organization': 15,
         'scope': 15,
         'part': 48,
         'foot': 3,
         'understand': 37,
         'society': 34,
         'cognition': 13,
         'basis': 28,
         'relationship': 17,
         'program': 67,
         'mexico': 14,
         'notify': 14,
         'special': 37,
         'characteristic': 9,
         'lecture': 17,
         'hardcopy': 13,
         'emphasis': 6,
         'summer': 16,
         'within': 45,
         'course': 41,
         'crosslinguistic': 8,
         'plan': 22,
         'enjoy': 5,
         'america': 25,
         'conceptual': 11,
         'july': 32,
         'city': 19,
         'functional': 25,
         'realistic': 2,
         'american': 29,
         'consideration': 17,
         'over': 32,
         'four': 22,
         'night': 9,
         'pragmatic': 39,
         'native': 24,
         'joan': 9,
         'psychological': 9,
         'addition': 24,
         'direction': 12,
         'share': 24,
         'der': 14,
         'germanic': 9,
         'modal': 7,
         'romance': 9,
         'belgium': 7,
         'around': 16,
         'fine': 6,
         'traditional': 16,
         'previous': 12,
         'pay': 21,
         'attract': 8,
         'possible': 58,
         'problem': 54,
         'organize': 41,
         'property': 17,
         'soon': 16,
         'sum': 11,
         'european': 39,
         'highly': 8,
         'propose': 27,
         'interdisciplinary': 14,
         'avoid': 8,
         'italy': 20,
         'logic': 17,
         'framework': 25,
         'utrecht': 9,
         'ph': 23,
         'elsewhere': 12,
         'let': 22,
         'serve': 12,
         'thus': 24,
         'attend': 14,
         'logical': 12,
         'fee': 32,
         'term': 38,
         'main': 31,
         'entitle': 5,
         'introductory': 8,
         'solution': 11,
         'advance': 43,
         'heart': 4,
         'variety': 36,
         'become': 27,
         'reduce': 8,
         'dus': 8,
         'explore': 16,
         'purpose': 19,
         'association': 33,
         'per': 27,
         'advertise': 2,
         'community': 20,
         'german': 36,
         'und': 9,
         'integration': 12,
         'im': 10,
         'near': 8,
         'sheffield': 10,
         'die': 12,
         'sense': 20,
         'college': 29,
         'east': 9,
         'explanation': 12,
         'ltd': 6,
         'trade': 2,
         'side': 12,
         'answer': 21,
         'essential': 4,
         'mail': 81,
         'innovative': 3,
         'material': 34,
         'sery': 10,
         'daily': 10,
         'canadian': 9,
         'moment': 2,
         'forthcome': 7,
         'record': 13,
         'classroom': 12,
         'genre': 9,
         'contrast': 15,
         'local': 20,
         'welcome': 28,
         'ave': 4,
         'increase': 12,
         'gold': 1,
         'co': 18,
         'textbook': 8,
         'every': 12,
         'unite': 17,
         'dr': 35,
         'contribute': 17,
         'latest': 8,
         'making': 1,
         'australium': 6,
         'describe': 21,
         'whole': 21,
         'grammatical': 32,
         'especially': 31,
         'discount': 1,
         'blvd': 2,
         'client': 2,
         'prof': 20,
         'assist': 8,
         'level': 37,
         'distribute': 10,
         'pb': 10,
         'peter': 30,
         'plus': 17,
         'kind': 29,
         'master': 7,
         'seller': 1,
         'beyond': 13,
         'mode': 9,
         'directly': 25,
         'difference': 18,
         'proposal': 23,
         'pl': 6,
         'across': 17,
         'charge': 10,
         'base': 57,
         'se': 9,
         'member': 22,
         'usage': 17,
         'pour': 6,
         'organisation': 6,
         'tutorial': 13,
         'universite': 11,
         'hour': 13,
         'respect': 13,
         'dan': 13,
         'forward': 15,
         'fr': 14,
         'edition': 7,
         'relevance': 15,
         'parallel': 13,
         'du': 9,
         'preference': 14,
         'marie': 13,
         'la': 19,
         'organiser': 16,
         'consider': 47,
         'leave': 18,
         'half': 9,
         'summary': 35,
         'hold': 63,
         'mailto': 1,
         'en': 10,
         'president': 6,
         'organise': 12,
         'team': 3,
         'institut': 13,
         'exact': 4,
         'dictionary': 20,
         'many': 56,
         'japanese': 26,
         'behalf': 6,
         'reply': 12,
         'graduate': 30,
         'enough': 11,
         'after': 40,
         'hbe': 6,
         'lot': 14,
         'former': 5,
         'phrase': 26,
         'unfortunately': 3,
         'integrate': 21,
         'japan': 20,
         'jp': 14,
         'edit': 10,
         'dear': 14,
         'link': 22,
         'here': 40,
         'recently': 25,
         't': 33,
         'doubt': 5,
         'while': 33,
         'participant': 38,
         'alway': 16,
         'upon': 12,
         'almost': 4,
         'off': 7,
         'thousand': 2,
         'means': 20,
         'themselve': 15,
         'count': 8,
         'access': 22,
         'occur': 11,
         'sentence': 25,
         'indo': 10,
         'borrow': 6,
         'suggest': 22,
         'effect': 30,
         'bilingual': 10,
         'back': 15,
         'necessarily': 10,
         'surface': 17,
         'lexicon': 22,
         'maintain': 6,
         'refer': 20,
         'w': 36,
         'same': 41,
         'multilingual': 19,
         'dialect': 20,
         'voice': 8,
         'another': 28,
         'morpheme': 10,
         'anthropology': 11,
         'comparative': 27,
         'account': 41,
         'noun': 20,
         'urge': 4,
         'york': 35,
         'fill': 13,
         'recognize': 5,
         'influence': 14,
         'easiest': 1,
         'boundary': 6,
         'bibliography': 15,
         'feel': 11,
         'article': 32,
         'item': 18,
         'previously': 14,
         'draw': 11,
         'component': 8,
         'accord': 17,
         'hardly': 9,
         'whether': 38,
         'literary': 9,
         'example': 61,
         'identical': 11,
         'central': 18,
         'constituent': 9,
         'attach': 7,
         'generative': 15,
         'subscribe': 7,
         'subscription': 7,
         'dissertation': 15,
         'unpublish': 16,
         'notice': 19,
         'report': 30,
         'line': 21,
         'appear': 33,
         'newsletter': 3,
         'spring': 4,
         'max': 14,
         'volumes': 5,
         'actual': 6,
         'al': 10,
         'journal': 27,
         'onto': 2,
         'collection': 12,
         'cd': 4,
         'user': 16,
         'amount': 15,
         'institution': 13,
         'believe': 15,
         'ago': 17,
         'full': 41,
         'deliver': 3,
         'multus': 8,
         'move': 18,
         'image': 12,
         'produce': 15,
         'media': 9,
         'medical': 5,
         'among': 28,
         'greater': 6,
         'open': 41,
         'outstand': 4,
         'down': 5,
         'play': 16,
         'thing': 20,
         ...})

For every common word $w$ we compute the probability that $w$ occurs in a spam or ham email. The formula for spam is: $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$}}{\mbox{number of all spam emails}} $$ The formula for ham is similar: $$ P(w \in\texttt{Ham}) = \frac{\mbox{number of ham emails containing $w$}}{\mbox{number of all ham emails}} $$ However, if we would use this formular, than a common word $w$ that, for some reason, hasn't yet occurred in any spam email, would have a probability of $0$ of occurring in spam email. Hence, our classifier would never classify an email with the word $w$ as spam. As this cannot be right, we assume that there is one further spam email that contains every common word. This Laplace smoothing assumption changes the formula for $P(w \in\texttt{Spam})$ as follows: $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$ + 1}}{\mbox{number of all spam emails + 1}} $$


In [16]:
Spam_Probability = {}
Ham__Probability = {}
for w in Common_Words:
    Spam_Probability[w] = (Spam_Counter[w] + 1) / (no_spam + 1) 
    Ham__Probability[w] = (Ham__Counter[w] + 1) / (no_ham  + 1) 
Spam_Probability


Out[16]:
{'load': 0.037037037037037035,
 'comprise': 0.011396011396011397,
 'familiar': 0.022792022792022793,
 'teen': 0.045584045584045586,
 'massive': 0.02564102564102564,
 'gamble': 0.04843304843304843,
 'none': 0.019943019943019943,
 'implementation': 0.002849002849002849,
 'majority': 0.05128205128205128,
 'cgibin': 0.02564102564102564,
 'rejection': 0.005698005698005698,
 'smaller': 0.002849002849002849,
 'launch': 0.03418803418803419,
 'lee': 0.011396011396011397,
 'database': 0.07977207977207977,
 'food': 0.022792022792022793,
 'window': 0.09686609686609686,
 'transfer': 0.02564102564102564,
 'candidate': 0.017094017094017096,
 'delay': 0.02849002849002849,
 'frank': 0.03418803418803419,
 'multus': 0.05128205128205128,
 'late': 0.039886039886039885,
 'engage': 0.011396011396011397,
 'work': 0.35327635327635326,
 'government': 0.039886039886039885,
 'newest': 0.03133903133903134,
 'call': 0.3789173789173789,
 'vulgarity': 0.022792022792022793,
 'material': 0.07122507122507123,
 'organisation': 0.008547008547008548,
 'affect': 0.008547008547008548,
 'hundreds': 0.045584045584045586,
 'propose': 0.02849002849002849,
 'john': 0.042735042735042736,
 'campus': 0.002849002849002849,
 'competition': 0.07407407407407407,
 'view': 0.05128205128205128,
 'penny': 0.05128205128205128,
 'currency': 0.03133903133903134,
 'gender': 0.002849002849002849,
 'class': 0.07692307692307693,
 'santa': 0.014245014245014245,
 'andrew': 0.005698005698005698,
 'bonus': 0.11965811965811966,
 'refinance': 0.037037037037037035,
 'organization': 0.06837606837606838,
 'eric': 0.008547008547008548,
 'site': 0.3475783475783476,
 'ongo': 0.014245014245014245,
 'italy': 0.022792022792022793,
 'sprachwissenschaft': 0.002849002849002849,
 'operator': 0.017094017094017096,
 'little': 0.16809116809116809,
 'appear': 0.05698005698005698,
 'ms': 0.02849002849002849,
 'actually': 0.07122507122507123,
 'perform': 0.03418803418803419,
 'monthly': 0.08262108262108261,
 'opposite': 0.008547008547008548,
 'latex': 0.005698005698005698,
 'job': 0.1396011396011396,
 'forum': 0.017094017094017096,
 'correct': 0.042735042735042736,
 'install': 0.02849002849002849,
 'miss': 0.1225071225071225,
 'local': 0.07407407407407407,
 'remain': 0.03418803418803419,
 'chain': 0.042735042735042736,
 'music': 0.042735042735042736,
 'ready': 0.1168091168091168,
 'hundr': 0.02849002849002849,
 'dinner': 0.008547008547008548,
 'bill': 0.15669515669515668,
 'singapore': 0.002849002849002849,
 'option': 0.05698005698005698,
 'multimedium': 0.008547008547008548,
 'dialect': 0.002849002849002849,
 'translation': 0.002849002849002849,
 'most': 0.32193732193732194,
 'different': 0.1794871794871795,
 'literature': 0.005698005698005698,
 'unite': 0.09116809116809117,
 'sit': 0.06552706552706553,
 'sun': 0.02564102564102564,
 'desirous': 0.03133903133903134,
 'bear': 0.017094017094017096,
 'scientist': 0.005698005698005698,
 'income': 0.19943019943019943,
 'urge': 0.019943019943019943,
 'life': 0.18803418803418803,
 'extensive': 0.02564102564102564,
 'label': 0.02564102564102564,
 'city': 0.19943019943019943,
 'july': 0.019943019943019943,
 'mouse': 0.02849002849002849,
 'win': 0.25071225071225073,
 'continent': 0.017094017094017096,
 'die': 0.017094017094017096,
 'tom': 0.011396011396011397,
 'diploma': 0.022792022792022793,
 'edit': 0.022792022792022793,
 'fulfill': 0.03418803418803419,
 'sequence': 0.05128205128205128,
 'lucky': 0.05982905982905983,
 'less': 0.16524216524216523,
 'manufacturer': 0.02849002849002849,
 'implication': 0.002849002849002849,
 'colingacl': 0.002849002849002849,
 'global': 0.05128205128205128,
 'referral': 0.02564102564102564,
 'western': 0.011396011396011397,
 'using': 0.07977207977207977,
 'map': 0.005698005698005698,
 'great': 0.2336182336182336,
 'response': 0.1282051282051282,
 'wrong': 0.039886039886039885,
 'bind': 0.017094017094017096,
 'analyse': 0.005698005698005698,
 'busy': 0.017094017094017096,
 'pb': 0.002849002849002849,
 'commerce': 0.022792022792022793,
 'ext': 0.045584045584045586,
 'polish': 0.005698005698005698,
 'worldwide': 0.07692307692307693,
 'previously': 0.04843304843304843,
 'peter': 0.014245014245014245,
 'studies': 0.002849002849002849,
 'appropriate': 0.017094017094017096,
 'mailbox': 0.07122507122507123,
 'again': 0.22507122507122507,
 'partner': 0.07977207977207977,
 'truly': 0.08831908831908832,
 'catch': 0.04843304843304843,
 'develop': 0.042735042735042736,
 'meeting': 0.019943019943019943,
 'title': 0.09686609686609686,
 'cfp': 0.002849002849002849,
 'parttime': 0.02849002849002849,
 'begin': 0.09401709401709402,
 'dozen': 0.017094017094017096,
 'addition': 0.037037037037037035,
 'artificial': 0.019943019943019943,
 'experiment': 0.008547008547008548,
 'mci': 0.037037037037037035,
 'around': 0.09686609686609686,
 'alexis': 0.002849002849002849,
 'april': 0.022792022792022793,
 'dictionary': 0.002849002849002849,
 'receive': 0.45014245014245013,
 'internet': 0.3561253561253561,
 'exercise': 0.014245014245014245,
 'edinburgh': 0.002849002849002849,
 'oversea': 0.022792022792022793,
 'paper': 0.09971509971509972,
 'charge': 0.1396011396011396,
 'j': 0.03418803418803419,
 'store': 0.07977207977207977,
 'nice': 0.05982905982905983,
 'raleigh': 0.037037037037037035,
 'six': 0.07977207977207977,
 'style': 0.014245014245014245,
 'www': 0.3162393162393162,
 'member': 0.15669515669515668,
 'activity': 0.037037037037037035,
 'y': 0.014245014245014245,
 'onetime': 0.03133903133903134,
 'comment': 0.03418803418803419,
 'status': 0.014245014245014245,
 'means': 0.05413105413105413,
 'reap': 0.03418803418803419,
 'chat': 0.04843304843304843,
 'utility': 0.037037037037037035,
 'usage': 0.011396011396011397,
 'beautiful': 0.05982905982905983,
 'extractor': 0.03418803418803419,
 'hotmail': 0.037037037037037035,
 'phrase': 0.011396011396011397,
 'doctor': 0.02849002849002849,
 'generally': 0.005698005698005698,
 'highly': 0.05698005698005698,
 'helpful': 0.014245014245014245,
 'access': 0.13105413105413105,
 'red': 0.02564102564102564,
 'ac': 0.019943019943019943,
 'usa': 0.09116809116809117,
 'structural': 0.002849002849002849,
 'likewise': 0.014245014245014245,
 'major': 0.16524216524216523,
 'wherea': 0.002849002849002849,
 'acl': 0.002849002849002849,
 'canadian': 0.019943019943019943,
 'cds': 0.03133903133903134,
 'mastercard': 0.10256410256410256,
 'programme': 0.002849002849002849,
 'living': 0.02849002849002849,
 'ps': 0.02564102564102564,
 'goe': 0.019943019943019943,
 'design': 0.09971509971509972,
 'end': 0.13675213675213677,
 'foot': 0.03133903133903134,
 'postscript': 0.002849002849002849,
 'empirical': 0.002849002849002849,
 'color': 0.042735042735042736,
 'corner': 0.02564102564102564,
 'unsubscribe': 0.05982905982905983,
 'change': 0.17094017094017094,
 'deal': 0.09971509971509972,
 'substantial': 0.05698005698005698,
 'planet': 0.02849002849002849,
 'michigan': 0.002849002849002849,
 'dollars': 0.05698005698005698,
 'simon': 0.005698005698005698,
 'trip': 0.042735042735042736,
 'award': 0.039886039886039885,
 'credit': 0.20512820512820512,
 'distinguish': 0.005698005698005698,
 'quantifier': 0.002849002849002849,
 'registration': 0.02849002849002849,
 'grateful': 0.005698005698005698,
 'doe': 0.02849002849002849,
 'hesitate': 0.07407407407407407,
 'psycholinguistic': 0.002849002849002849,
 'robert': 0.014245014245014245,
 'publication': 0.037037037037037035,
 'typical': 0.022792022792022793,
 'demo': 0.042735042735042736,
 'log': 0.03418803418803419,
 'e': 0.25071225071225073,
 'interest': 0.28205128205128205,
 'm': 0.22507122507122507,
 'tremendous': 0.03418803418803419,
 'simple': 0.20797720797720798,
 'phonological': 0.002849002849002849,
 'excellent': 0.07122507122507123,
 'enjoy': 0.10541310541310542,
 'genie': 0.02849002849002849,
 'preview': 0.037037037037037035,
 'december': 0.037037037037037035,
 'impossible': 0.011396011396011397,
 'america': 0.08547008547008547,
 'webmaster': 0.037037037037037035,
 'dori': 0.02849002849002849,
 'le': 0.011396011396011397,
 'de': 0.014245014245014245,
 'please': 0.5384615384615384,
 'reality': 0.02564102564102564,
 'heart': 0.02849002849002849,
 'weeks': 0.03418803418803419,
 'parse': 0.002849002849002849,
 'meet': 0.09116809116809117,
 'russian': 0.005698005698005698,
 'dear': 0.10826210826210826,
 'netherland': 0.008547008547008548,
 'speak': 0.019943019943019943,
 'floor': 0.011396011396011397,
 'blvd': 0.037037037037037035,
 'entirely': 0.014245014245014245,
 'clearance': 0.03133903133903134,
 'practical': 0.011396011396011397,
 'contents': 0.019943019943019943,
 'medium': 0.02564102564102564,
 'lay': 0.045584045584045586,
 'somewhat': 0.019943019943019943,
 'surely': 0.04843304843304843,
 'accompany': 0.002849002849002849,
 'buy': 0.22792022792022792,
 'judgment': 0.03133903133903134,
 'orders': 0.06837606837606838,
 'fairchild': 0.03133903133903134,
 'zero': 0.019943019943019943,
 'focus': 0.017094017094017096,
 'spout': 0.03133903133903134,
 'wish': 0.245014245014245,
 'translate': 0.008547008547008548,
 'vendor': 0.03133903133903134,
 'faith': 0.042735042735042736,
 'sincerely': 0.10256410256410256,
 'mixe': 0.011396011396011397,
 'draw': 0.019943019943019943,
 'typology': 0.002849002849002849,
 'joan': 0.002849002849002849,
 'participation': 0.02849002849002849,
 'quote': 0.03133903133903134,
 'integrate': 0.005698005698005698,
 'recently': 0.06267806267806268,
 'vocabulary': 0.002849002849002849,
 'interaction': 0.002849002849002849,
 'kit': 0.03133903133903134,
 'cycle': 0.02564102564102564,
 'session': 0.011396011396011397,
 'august': 0.005698005698005698,
 'sometime': 0.05128205128205128,
 'non': 0.019943019943019943,
 'volumes': 0.017094017094017096,
 'anderson': 0.008547008547008548,
 'anna': 0.002849002849002849,
 'discussion': 0.011396011396011397,
 'diskette': 0.02564102564102564,
 'finding': 0.014245014245014245,
 'entire': 0.08547008547008547,
 'fl': 0.037037037037037035,
 'tip': 0.05698005698005698,
 'player': 0.03133903133903134,
 'traditional': 0.05413105413105413,
 'lie': 0.017094017094017096,
 'opportunity': 0.21082621082621084,
 'ltd': 0.008547008547008548,
 'shop': 0.07692307692307693,
 'front': 0.05698005698005698,
 'reread': 0.02849002849002849,
 'alway': 0.16524216524216523,
 'condition': 0.019943019943019943,
 'band': 0.02564102564102564,
 'sex': 0.06552706552706553,
 'eye': 0.04843304843304843,
 'proper': 0.022792022792022793,
 'century': 0.03133903133903134,
 'avenue': 0.03418803418803419,
 'buck': 0.022792022792022793,
 'motivation': 0.002849002849002849,
 'postfach': 0.002849002849002849,
 'macintosh': 0.017094017094017096,
 'ignore': 0.06552706552706553,
 'inquiry': 0.04843304843304843,
 'vary': 0.02564102564102564,
 'md': 0.04843304843304843,
 'poor': 0.03418803418803419,
 'move': 0.1111111111111111,
 'top': 0.1452991452991453,
 'capture': 0.017094017094017096,
 'european': 0.008547008547008548,
 'harri': 0.002849002849002849,
 'israel': 0.008547008547008548,
 'modify': 0.03133903133903134,
 'eventually': 0.03418803418803419,
 'conversation': 0.019943019943019943,
 'assessment': 0.008547008547008548,
 'produce': 0.09686609686609686,
 'coverage': 0.011396011396011397,
 'media': 0.011396011396011397,
 'click': 0.28774928774928776,
 'indo': 0.002849002849002849,
 'ma': 0.039886039886039885,
 'female': 0.017094017094017096,
 'genuine': 0.022792022792022793,
 'typological': 0.002849002849002849,
 'interdisciplinary': 0.005698005698005698,
 'predicate': 0.002849002849002849,
 'banner': 0.02564102564102564,
 'provide': 0.20512820512820512,
 'back': 0.245014245014245,
 'generate': 0.12535612535612536,
 'independent': 0.06267806267806268,
 'june': 0.03133903133903134,
 'monday': 0.02849002849002849,
 'individual': 0.07407407407407407,
 'speed': 0.05982905982905983,
 'dan': 0.005698005698005698,
 'thing': 0.16524216524216523,
 'demand': 0.02564102564102564,
 'wealth': 0.03133903133903134,
 'value': 0.08831908831908832,
 'programs': 0.045584045584045586,
 'texa': 0.017094017094017096,
 'nt': 0.3646723646723647,
 'national': 0.039886039886039885,
 'introduction': 0.014245014245014245,
 'amazing': 0.05698005698005698,
 'hr': 0.045584045584045586,
 'intelligent': 0.005698005698005698,
 'request': 0.23076923076923078,
 'surface': 0.008547008547008548,
 'classified': 0.022792022792022793,
 'policy': 0.017094017094017096,
 'mediumsize': 0.02564102564102564,
 'plain': 0.042735042735042736,
 'hit': 0.1225071225071225,
 'im': 0.008547008547008548,
 'commonly': 0.017094017094017096,
 'star': 0.04843304843304843,
 'guy': 0.045584045584045586,
 'symposium': 0.002849002849002849,
 'w': 0.05413105413105413,
 'september': 0.017094017094017096,
 'forever': 0.05413105413105413,
 'notify': 0.02564102564102564,
 'rest': 0.08262108262108261,
 'repeat': 0.019943019943019943,
 'martin': 0.011396011396011397,
 'description': 0.014245014245014245,
 'undoubtedly': 0.022792022792022793,
 'myself': 0.06552706552706553,
 'zip': 0.150997150997151,
 'kong': 0.005698005698005698,
 'happen': 0.10256410256410256,
 'direct': 0.10541310541310542,
 'percentage': 0.03418803418803419,
 'grammar': 0.002849002849002849,
 'delivery': 0.09686609686609686,
 'reply': 0.2564102564102564,
 'du': 0.002849002849002849,
 'weekend': 0.037037037037037035,
 'although': 0.039886039886039885,
 'french': 0.005698005698005698,
 'rich': 0.07122507122507123,
 'http': 0.45014245014245013,
 'stay': 0.06267806267806268,
 'scheme': 0.03133903133903134,
 'conceptual': 0.005698005698005698,
 'deep': 0.02564102564102564,
 'subject': 0.2934472934472934,
 'abuse': 0.03133903133903134,
 'consult': 0.008547008547008548,
 'below': 0.2849002849002849,
 'fastest': 0.037037037037037035,
 'organiser': 0.002849002849002849,
 'comparable': 0.011396011396011397,
 'currently': 0.07122507122507123,
 'influence': 0.005698005698005698,
 'suppose': 0.03418803418803419,
 'htm': 0.07122507122507123,
 'enhance': 0.014245014245014245,
 'video': 0.11396011396011396,
 'jone': 0.008547008547008548,
 'document': 0.03418803418803419,
 'mb': 0.02564102564102564,
 'consideration': 0.02564102564102564,
 'apology': 0.02849002849002849,
 'iro': 0.002849002849002849,
 'michael': 0.019943019943019943,
 'result': 0.1623931623931624,
 'bid': 0.05413105413105413,
 'until': 0.1282051282051282,
 'excess': 0.05413105413105413,
 'put': 0.2222222222222222,
 'exclude': 0.014245014245014245,
 'hi': 0.07122507122507123,
 'unlimit': 0.05982905982905983,
 'bring': 0.10256410256410256,
 'explore': 0.008547008547008548,
 'trend': 0.011396011396011397,
 'try': 0.21652421652421652,
 'numbers': 0.03133903133903134,
 'expect': 0.07122507122507123,
 'organise': 0.002849002849002849,
 'ed': 0.017094017094017096,
 'history': 0.039886039886039885,
 'favorite': 0.03133903133903134,
 'due': 0.07692307692307693,
 'nc': 0.042735042735042736,
 'promptly': 0.02849002849002849,
 'yours': 0.15669515669515668,
 'san': 0.022792022792022793,
 'reports': 0.06267806267806268,
 'documentation': 0.014245014245014245,
 'initially': 0.042735042735042736,
 'near': 0.045584045584045586,
 'birth': 0.019943019943019943,
 'park': 0.04843304843304843,
 'trash': 0.039886039886039885,
 'firm': 0.02564102564102564,
 'confident': 0.037037037037037035,
 'virtually': 0.042735042735042736,
 'syntactic': 0.002849002849002849,
 'evergrow': 0.02849002849002849,
 'quit': 0.05413105413105413,
 'register': 0.07977207977207977,
 'general': 0.039886039886039885,
 'stun': 0.03133903133903134,
 'benefit': 0.07977207977207977,
 'quality': 0.07407407407407407,
 'spain': 0.002849002849002849,
 'teacher': 0.005698005698005698,
 'research': 0.07692307692307693,
 'pic': 0.017094017094017096,
 'participant': 0.039886039886039885,
 'evaluation': 0.014245014245014245,
 'responsible': 0.03133903133903134,
 'perceive': 0.002849002849002849,
 'institution': 0.008547008547008548,
 'bernard': 0.002849002849002849,
 'off': 0.2022792022792023,
 'cooperation': 0.005698005698005698,
 'marketing': 0.08547008547008547,
 'north': 0.022792022792022793,
 'illustrate': 0.002849002849002849,
 'hardcore': 0.02849002849002849,
 'yield': 0.019943019943019943,
 'newsgroup': 0.03133903133903134,
 'fantastic': 0.07977207977207977,
 'science': 0.008547008547008548,
 'beach': 0.05982905982905983,
 'innovative': 0.014245014245014245,
 'treat': 0.06552706552706553,
 'signal': 0.011396011396011397,
 'run': 0.1396011396011396,
 'exactly': 0.1111111111111111,
 'brand': 0.04843304843304843,
 'pennsylvanium': 0.002849002849002849,
 'wait': 0.14814814814814814,
 'contact': 0.19373219373219372,
 'minute': 0.11965811965811966,
 'totally': 0.05982905982905983,
 'statistics': 0.042735042735042736,
 'state': 0.2962962962962963,
 'amateur': 0.042735042735042736,
 'researcher': 0.005698005698005698,
 'clean': 0.05698005698005698,
 'cluster': 0.002849002849002849,
 'open': 0.13105413105413105,
 'reconstruction': 0.002849002849002849,
 'chri': 0.014245014245014245,
 'perfectly': 0.05982905982905983,
 'help': 0.2564102564102564,
 'completely': 0.10541310541310542,
 'operate': 0.06552706552706553,
 'loss': 0.039886039886039885,
 'watch': 0.10541310541310542,
 'approve': 0.03133903133903134,
 'someone': 0.1396011396011396,
 'arise': 0.005698005698005698,
 'scope': 0.005698005698005698,
 'development': 0.008547008547008548,
 'unless': 0.05413105413105413,
 'version': 0.045584045584045586,
 'necessarily': 0.002849002849002849,
 'edition': 0.019943019943019943,
 'dutch': 0.002849002849002849,
 'novel': 0.005698005698005698,
 'trial': 0.06837606837606838,
 'juno': 0.03418803418803419,
 'fast': 0.07977207977207977,
 'millions': 0.045584045584045586,
 'twenty': 0.019943019943019943,
 'dramatically': 0.03418803418803419,
 'anywhere': 0.14245014245014245,
 'original': 0.05698005698005698,
 'acceptance': 0.017094017094017096,
 'downsize': 0.03133903133903134,
 'today': 0.3333333333333333,
 'exact': 0.05128205128205128,
 'weekly': 0.07692307692307693,
 'keynote': 0.002849002849002849,
 'forget': 0.07977207977207977,
 'characteristic': 0.005698005698005698,
 'txt': 0.03418803418803419,
 'even': 0.2849002849002849,
 'increase': 0.11396011396011396,
 'clear': 0.045584045584045586,
 'advertiser': 0.042735042735042736,
 'purchase': 0.21082621082621084,
 'date': 0.15384615384615385,
 'integration': 0.005698005698005698,
 'conversational': 0.002849002849002849,
 'adult': 0.1225071225071225,
 'classic': 0.011396011396011397,
 'plan': 0.11965811965811966,
 'earth': 0.06267806267806268,
 'bottom': 0.06837606837606838,
 'associate': 0.06552706552706553,
 'sales': 0.06552706552706553,
 'south': 0.045584045584045586,
 'comprehensive': 0.019943019943019943,
 'making': 0.03418803418803419,
 'transcription': 0.002849002849002849,
 'easily': 0.1168091168091168,
 'finger': 0.03418803418803419,
 'la': 0.02564102564102564,
 'mortgage': 0.05128205128205128,
 'add': 0.22792022792022792,
 'conceal': 0.037037037037037035,
 'verbal': 0.005698005698005698,
 'underlie': 0.002849002849002849,
 'long': 0.11396011396011396,
 'import': 0.02564102564102564,
 'analysis': 0.014245014245014245,
 'tool': 0.10541310541310542,
 'application': 0.045584045584045586,
 'perhap': 0.014245014245014245,
 'snail': 0.03133903133903134,
 'review': 0.06837606837606838,
 'easiest': 0.04843304843304843,
 'extremely': 0.07407407407407407,
 'though': 0.05698005698005698,
 'verify': 0.045584045584045586,
 'leave': 0.16524216524216523,
 'virtual': 0.03133903133903134,
 'dynamic': 0.011396011396011397,
 'recipient': 0.039886039886039885,
 'couple': 0.06267806267806268,
 'least': 0.1282051282051282,
 'germany': 0.019943019943019943,
 'useful': 0.03133903133903134,
 'client': 0.05413105413105413,
 'television': 0.042735042735042736,
 'prof': 0.008547008547008548,
 'scott': 0.005698005698005698,
 'phd': 0.019943019943019943,
 'importance': 0.008547008547008548,
 'gb': 0.011396011396011397,
 'korean': 0.002849002849002849,
 'order': 0.3732193732193732,
 'variation': 0.002849002849002849,
 'aim': 0.008547008547008548,
 'trust': 0.039886039886039885,
 'thoma': 0.017094017094017096,
 'corporations': 0.04843304843304843,
 'musical': 0.017094017094017096,
 'much': 0.2934472934472934,
 'lifetime': 0.04843304843304843,
 'intrusion': 0.045584045584045586,
 'set': 0.12535612535612536,
 'iii': 0.014245014245014245,
 'potential': 0.13105413105413105,
 'concept': 0.017094017094017096,
 'country': 0.11965811965811966,
 'hotel': 0.022792022792022793,
 'school': 0.04843304843304843,
 'master': 0.03418803418803419,
 'acquire': 0.06552706552706553,
 'laugh': 0.037037037037037035,
 'mo': 0.02564102564102564,
 'psychological': 0.002849002849002849,
 'club': 0.05413105413105413,
 'obviously': 0.06837606837606838,
 'debt': 0.08831908831908832,
 'extract': 0.045584045584045586,
 'vision': 0.011396011396011397,
 'million': 0.245014245014245,
 'acquisition': 0.008547008547008548,
 'object': 0.008547008547008548,
 'human': 0.022792022792022793,
 'sake': 0.037037037037037035,
 'include': 0.37037037037037035,
 't': 0.21082621082621084,
 'success': 0.1737891737891738,
 'theoretical': 0.002849002849002849,
 'community': 0.022792022792022793,
 'vacation': 0.09116809116809117,
 'sample': 0.05413105413105413,
 'wh': 0.005698005698005698,
 'five': 0.04843304843304843,
 'finance': 0.042735042735042736,
 'search': 0.1737891737891738,
 'datum': 0.02849002849002849,
 'engine': 0.09971509971509972,
 'while': 0.1339031339031339,
 'everybody': 0.022792022792022793,
 'february': 0.011396011396011397,
 'author': 0.022792022792022793,
 'editor': 0.005698005698005698,
 'summarize': 0.008547008547008548,
 'fundamental': 0.002849002849002849,
 'publish': 0.06837606837606838,
 'blackwell': 0.002849002849002849,
 'yourself': 0.20797720797720798,
 'hello': 0.10541310541310542,
 'investigation': 0.017094017094017096,
 'universal': 0.011396011396011397,
 'karen': 0.022792022792022793,
 'jump': 0.039886039886039885,
 'umontreal': 0.002849002849002849,
 'follow': 0.33903133903133903,
 'professional': 0.09686609686609686,
 'emerge': 0.005698005698005698,
 'once': 0.19658119658119658,
 'day': 0.4415954415954416,
 'jame': 0.008547008547008548,
 'slip': 0.037037037037037035,
 'medical': 0.02849002849002849,
 'borrow': 0.037037037037037035,
 'song': 0.019943019943019943,
 'idea': 0.08547008547008547,
 'hour': 0.2678062678062678,
 'argument': 0.002849002849002849,
 'obvious': 0.008547008547008548,
 'listing': 0.014245014245014245,
 'november': 0.008547008547008548,
 'subscription': 0.017094017094017096,
 'shift': 0.011396011396011397,
 'pleasure': 0.02564102564102564,
 'london': 0.03418803418803419,
 'however': 0.07977207977207977,
 'fun': 0.1282051282051282,
 'tree': 0.022792022792022793,
 'man': 0.05128205128205128,
 'classroom': 0.002849002849002849,
 'dates': 0.002849002849002849,
 'mt': 0.008547008547008548,
 'accommodation': 0.008547008547008548,
 'cm': 0.005698005698005698,
 'alternative': 0.011396011396011397,
 'send': 0.4415954415954416,
 'diversity': 0.002849002849002849,
 'framework': 0.002849002849002849,
 'bag': 0.022792022792022793,
 'cognition': 0.002849002849002849,
 'relationship': 0.008547008547008548,
 'reasonable': 0.022792022792022793,
 'att': 0.017094017094017096,
 'exceed': 0.019943019943019943,
 'von': 0.002849002849002849,
 'during': 0.02564102564102564,
 'region': 0.005698005698005698,
 'accessible': 0.037037037037037035,
 'linguistics': 0.005698005698005698,
 'refer': 0.02849002849002849,
 'observe': 0.005698005698005698,
 'britain': 0.002849002849002849,
 'creditor': 0.045584045584045586,
 'specify': 0.02564102564102564,
 'moneymake': 0.05982905982905983,
 'relevance': 0.002849002849002849,
 'honor': 0.017094017094017096,
 'connection': 0.07977207977207977,
 'news': 0.08831908831908832,
 'parameter': 0.005698005698005698,
 'twelve': 0.03418803418803419,
 'cv': 0.005698005698005698,
 'mike': 0.014245014245014245,
 'parallel': 0.005698005698005698,
 'apply': 0.05413105413105413,
 'is': 0.23931623931623933,
 'here': 0.4188034188034188,
 'short': 0.1111111111111111,
 'password': 0.02564102564102564,
 'mellon': 0.002849002849002849,
 'textbook': 0.002849002849002849,
 'telephone': 0.09116809116809117,
 'morn': 0.019943019943019943,
 'paragraph': 0.02849002849002849,
 'educational': 0.011396011396011397,
 'worth': 0.09401709401709402,
 'effective': 0.10826210826210826,
 'reflect': 0.005698005698005698,
 'normal': 0.011396011396011397,
 'compute': 0.011396011396011397,
 'ba': 0.02564102564102564,
 'pretty': 0.039886039886039885,
 'comparative': 0.002849002849002849,
 'contribute': 0.011396011396011397,
 'latest': 0.10541310541310542,
 'minimum': 0.03133903133903134,
 'interpretation': 0.002849002849002849,
 'australium': 0.011396011396011397,
 'drop': 0.07122507122507123,
 'tell': 0.20512820512820512,
 'fill': 0.15954415954415954,
 'richer': 0.042735042735042736,
 'compile': 0.02564102564102564,
 'residual': 0.03418803418803419,
 'convention': 0.008547008547008548,
 'belgium': 0.002849002849002849,
 'martha': 0.002849002849002849,
 'functional': 0.014245014245014245,
 'variety': 0.014245014245014245,
 'white': 0.008547008547008548,
 'spell': 0.005698005698005698,
 'interview': 0.05413105413105413,
 'isp': 0.05128205128205128,
 'cognitive': 0.002849002849002849,
 'slavic': 0.002849002849002849,
 'u': 0.1396011396011396,
 'possibility': 0.019943019943019943,
 'speech': 0.005698005698005698,
 'hire': 0.017094017094017096,
 'spokane': 0.03133903133903134,
 'japanese': 0.008547008547008548,
 'education': 0.03418803418803419,
 'unify': 0.002849002849002849,
 'distribute': 0.045584045584045586,
 'perception': 0.002849002849002849,
 'enquiry': 0.005698005698005698,
 'president': 0.022792022792022793,
 'poster': 0.005698005698005698,
 'african': 0.014245014245014245,
 'town': 0.03133903133903134,
 'overload': 0.039886039886039885,
 'everythe': 0.045584045584045586,
 'testimonial': 0.04843304843304843,
 'death': 0.03418803418803419,
 'christian': 0.005698005698005698,
 'purpose': 0.039886039886039885,
 'return': 0.18803418803418803,
 'team': 0.05982905982905983,
 'duration': 0.002849002849002849,
 'cinema': 0.02849002849002849,
 'lot': 0.150997150997151,
 'seem': 0.05413105413105413,
 'linguistic': 0.002849002849002849,
 'freedom': 0.10541310541310542,
 'v': 0.08262108262108261,
 'want': 0.41595441595441596,
 'cheap': 0.03418803418803419,
 'total': 0.13675213675213677,
 'dept': 0.022792022792022793,
 'literary': 0.002849002849002849,
 'second': 0.06837606837606838,
 'honest': 0.05128205128205128,
 'wife': 0.037037037037037035,
 'head': 0.037037037037037035,
 'agency': 0.05413105413105413,
 'final': 0.039886039886039885,
 'assume': 0.06837606837606838,
 'berlin': 0.002849002849002849,
 'alone': 0.07407407407407407,
 'compete': 0.019943019943019943,
 'three': 0.07407407407407407,
 'attach': 0.039886039886039885,
 'montreal': 0.002849002849002849,
 'movie': 0.05698005698005698,
 'consist': 0.011396011396011397,
 're': 0.29914529914529914,
 'incredible': 0.05413105413105413,
 'lack': 0.022792022792022793,
 'mistake': 0.022792022792022793,
 'satisfy': 0.05698005698005698,
 'fine': 0.03133903133903134,
 'middle': 0.037037037037037035,
 'ask': 0.19373219373219372,
 'capability': 0.017094017094017096,
 'guest': 0.011396011396011397,
 'discover': 0.09116809116809117,
 'resort': 0.02849002849002849,
 'amount': 0.17663817663817663,
 'define': 0.002849002849002849,
 'certainly': 0.03418803418803419,
 'problem': 0.13675213675213677,
 'organize': 0.008547008547008548,
 'believe': 0.1623931623931624,
 'resell': 0.06552706552706553,
 'survey': 0.017094017094017096,
 'whatsoever': 0.039886039886039885,
 'lisa': 0.008547008547008548,
 'cent': 0.05698005698005698,
 'largest': 0.05698005698005698,
 'requirement': 0.039886039886039885,
 'property': 0.03133903133903134,
 'msn': 0.02849002849002849,
 'mail': 0.5128205128205128,
 'loan': 0.05982905982905983,
 'domain': 0.06267806267806268,
 'released': 0.03133903133903134,
 'pour': 0.008547008547008548,
 'sum': 0.008547008547008548,
 'complete': 0.15669515669515668,
 'syntax': 0.002849002849002849,
 'symbol': 0.019943019943019943,
 'command': 0.019943019943019943,
 'sheffield': 0.002849002849002849,
 'interpret': 0.002849002849002849,
 'reviewer': 0.008547008547008548,
 'merciless': 0.03133903133903134,
 'nijmegen': 0.002849002849002849,
 'dupe': 0.02564102564102564,
 'industrial': 0.008547008547008548,
 'phonology': 0.002849002849002849,
 'create': 0.1623931623931624,
 'pittsburgh': 0.008547008547008548,
 'radio': 0.05128205128205128,
 'together': 0.05982905982905983,
 'sender': 0.06837606837606838,
 'privacy': 0.037037037037037035,
 'nature': 0.008547008547008548,
 'length': 0.011396011396011397,
 'contributor': 0.002849002849002849,
 'forthcome': 0.014245014245014245,
 'bet': 0.042735042735042736,
 'genre': 0.002849002849002849,
 'product': 0.28205128205128205,
 'expiration': 0.06267806267806268,
 'indiana': 0.002849002849002849,
 'nl': 0.002849002849002849,
 'obligation': 0.037037037037037035,
 'newsletter': 0.045584045584045586,
 'ii': 0.011396011396011397,
 'emailer': 0.03418803418803419,
 'exclusive': 0.06837606837606838,
 'note': 0.17094017094017094,
 'maintain': 0.03418803418803419,
 'assure': 0.02849002849002849,
 'rock': 0.022792022792022793,
 'quick': 0.08262108262108261,
 'retail': 0.03133903133903134,
 'copy': 0.18518518518518517,
 'x': 0.1339031339031339,
 'ad': 0.14245014245014245,
 'light': 0.014245014245014245,
 'serve': 0.02564102564102564,
 'toy': 0.02849002849002849,
 'press': 0.037037037037037035,
 'file': 0.15954415954415954,
 'days': 0.042735042735042736,
 'gold': 0.05982905982905983,
 'strength': 0.014245014245014245,
 'few': 0.19658119658119658,
 'eastern': 0.011396011396011397,
 'both': 0.13105413105413105,
 'dependency': 0.002849002849002849,
 'chomsky': 0.002849002849002849,
 'course': 0.1168091168091168,
 'morpheme': 0.002849002849002849,
 'financially': 0.04843304843304843,
 'device': 0.017094017094017096,
 'behind': 0.02849002849002849,
 'situation': 0.03133903133903134,
 'parent': 0.02564102564102564,
 'foundation': 0.017094017094017096,
 'bit': 0.05698005698005698,
 'orient': 0.008547008547008548,
 'stealth': 0.06552706552706553,
 'remember': 0.13105413105413105,
 'vium': 0.13105413105413105,
 'personally': 0.017094017094017096,
 'difficult': 0.037037037037037035,
 'money': 0.4017094017094017,
 'jp': 0.005698005698005698,
 'institute': 0.011396011396011397,
 'disc': 0.017094017094017096,
 'lyric': 0.017094017094017096,
 'correspondence': 0.008547008547008548,
 'cancel': 0.02564102564102564,
 'probably': 0.07407407407407407,
 'asset': 0.042735042735042736,
 'prompt': 0.04843304843304843,
 'equipment': 0.017094017094017096,
 'feature': 0.06552706552706553,
 'id': 0.06267806267806268,
 'conclude': 0.037037037037037035,
 'need': 0.41595441595441596,
 'sale': 0.16524216524216523,
 'truth': 0.02564102564102564,
 'automatically': 0.09116809116809117,
 'mix': 0.037037037037037035,
 'spend': 0.1396011396011396,
 'sorry': 0.06837606837606838,
 'index': 0.05698005698005698,
 'art': 0.03133903133903134,
 'fact': 0.12535612535612536,
 'department': 0.02564102564102564,
 'common': 0.042735042735042736,
 'estate': 0.039886039886039885,
 'publisher': 0.019943019943019943,
 'basically': 0.04843304843304843,
 'conjunction': 0.002849002849002849,
 'ram': 0.022792022792022793,
 'arrange': 0.014245014245014245,
 'whether': 0.05982905982905983,
 'example': 0.10256410256410256,
 'sure': 0.18803418803418803,
 'plenary': 0.002849002849002849,
 'love': 0.1111111111111111,
 'preparation': 0.014245014245014245,
 'actual': 0.02849002849002849,
 'cameraready': 0.002849002849002849,
 'cost': 0.2849002849002849,
 'point': 0.08831908831908832,
 'experience': 0.18233618233618235,
 'quickly': 0.08547008547008547,
 'thousands': 0.03133903133903134,
 'ultimate': 0.03133903133903134,
 'browser': 0.02564102564102564,
 'started': 0.03418803418803419,
 'reg': 0.03418803418803419,
 'coordinate': 0.002849002849002849,
 'rush': 0.042735042735042736,
 'network': 0.06552706552706553,
 'indefinite': 0.002849002849002849,
 'delete': 0.11396011396011396,
 'mailer': 0.045584045584045586,
 'package': 0.1396011396011396,
 'instead': 0.05128205128205128,
 'side': 0.011396011396011397,
 'already': 0.14814814814814814,
 'essential': 0.03133903133903134,
 'ever': 0.22507122507122507,
 'responsibility': 0.019943019943019943,
 'mit': 0.002849002849002849,
 'brief': 0.02564102564102564,
 'desire': 0.06267806267806268,
 'discovery': 0.019943019943019943,
 'royal': 0.011396011396011397,
 'update': 0.08262108262108261,
 'intelligence': 0.045584045584045586,
 'control': 0.08831908831908832,
 'present': 0.07977207977207977,
 'round': 0.011396011396011397,
 'sponsor': 0.022792022792022793,
 'occur': 0.017094017094017096,
 'philosophy': 0.005698005698005698,
 'current': 0.06267806267806268,
 'web': 0.2678062678062678,
 'text': 0.09116809116809117,
 'broadcast': 0.017094017094017096,
 'spot': 0.022792022792022793,
 'effect': 0.017094017094017096,
 'tradition': 0.002849002849002849,
 'tv': 0.05413105413105413,
 'germanic': 0.002849002849002849,
 ...}

According to our computation, the probabilty that a spam email contains the word 'consonant' is about $0.28\%$, while the probability that this word occurs in a ham email is $2.55\%$.


In [17]:
Spam_Probability['consonant'], Ham__Probability['consonant']


Out[17]:
(0.002849002849002849, 0.02564102564102564)

For the word 'dollar' the probabilty that a spam email contains this word is about $21.1\%$, while the probability that this word occurs in a ham email is $1.99\%$.


In [18]:
Spam_Probability['dollar'], Ham__Probability['dollar']


Out[18]:
(0.21082621082621084, 0.019943019943019943)

Deciding whether an Email is Spam

Given a file name fn, this function returns the probability that the message contained in the given file is spam.

When implementing the formula $$\arg\max\limits_{C \in \mathcal{C}} \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$ we have to be careful, because a naive implementation will eveluate the product $$\prod\limits_{i=1}^m P(f_i \;|\; C)$$ as the number $0$ due to numerical underflow. The trick to compute this product is to remember that $$ \ln(a \cdot b) = \ln(a) + \ln(b) $$ and therefore transform the product into a sum of logarithms: $$ \prod\limits_{i=1}^m P(f_i \;|\; C) = \exp\left(\alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) \right) \cdot \exp(-\alpha)$$ Here, the constant $\alpha$ has to be chosen such that the application of the function exp to the value $$ \alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) $$ does not lead to an underflow error.

As we want to compute a probability, we have to be aware that the term $$ \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$ is not the probability that the object is of class $C$ but rather is only proportional to this probability. The fact that the probability of an email being spam + the probability that the email is ham must be $1$ enables us to compute the probability.


In [19]:
def spam_probability(fn):
    log_p_spam = 0.0
    log_p_ham  = 0.0
    words = get_common_words(fn)
    for w in Common_Words:
        if w in words:
            log_p_spam += math.log(Spam_Probability[w])
            log_p_ham  += math.log(Ham__Probability[w])
        else:
            log_p_spam += math.log(1.0 - Spam_Probability[w])
            log_p_ham  += math.log(1.0 - Ham__Probability[w])
    alpha  = abs(max(log_p_spam, log_p_ham))
    p_spam = math.exp(log_p_spam + alpha) * spam_prior
    p_ham  = math.exp(log_p_ham  + alpha) * ham__prior
    return p_spam / (p_spam + p_ham)

Let us test this with a ham email.


In [20]:
spam_probability('EmailData/ham-train/3-430msg1.txt')


Out[20]:
6.289803980920058e-29

Ok, we got this one right. Let us check the general performance.

Evaluate Precision and Recall

In order to evalate the performance of this algorithm, we need to define two new concepts: precision and recall. Let us call the ham emails the positives, while the spam emails are called the negatives. Then we define

  • true positives: ham emails that are classified as ham,
  • false positives: spam emails that are classified as ham,
  • true negatives: spam emails that are classified as spam,
  • false negatives: ham emails that are classified as spam.

The precision of the spam classifier is then defined as $$ \texttt{precision} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false positives}} $$ Therefore, the precision measures the percentage of the ham emails in the set of all emails that are classified as ham. The recall of the spam classifier is defined as $$ \texttt{recall} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false negatives}} $$ Therefore, the recall measures the percentage of those ham emails that are indeed classified as ham.

Usually, it is very important that the recall is high as we don't want to loose a ham email because our classifier has incorrectly classified it as a spam email.
On the other hand, having a high precision is not that important. After all, if $10\%$ of the emails offered to us as ham are, in fact, spam, we might tolerate this. However, we would certainly not tolerate loosing $10\%$ of our ham emails because they are incorrectly specified as spam.

The function precission_recall takes two directories as arguments: spam_dir is supposed to contain spam emails, while ham_dir contains ham emails. It computes the precision and the recall of our spam classifier with respect to these test data.


In [21]:
def precission_recall(spam_dir, ham_dir):
    TN = 0 # true negatives
    FP = 0 # false positives
    for email in os.listdir(spam_dir):
        if spam_probability(spam_dir + email) > 0.5:
            TN += 1
        else:
            FP += 1
    FN = 0 # false negatives
    TP = 0 # true positives
    for email in os.listdir(ham_dir):
        if spam_probability(ham_dir + email) > 0.5:
            FN += 1
        else:
            TP += 1
    precision = TP / (TP + FP)
    recall    = TP / (TP + FN)
    accuracy  = (TN + TP) / (TN + TP + FN + FP)
    return precision, recall, accuracy

In [22]:
precission_recall(spam_dir_train, ham__dir_train)


Out[22]:
(0.8495145631067961, 1.0, 0.9114285714285715)

In [23]:
precission_recall(spam_dir_test, ham__dir_test)


Out[23]:
(0.7791411042944786, 0.9769230769230769, 0.85)

In [ ]: